skoppu22 commented on code in PR #168:
URL:
https://github.com/apache/cassandra-analytics/pull/168#discussion_r2789300662
##########
docs/src/user.adoc:
##########
@@ -0,0 +1,318 @@
+= Overview
+
+This document describes the configuration options available for the bulk
reader and bulk writer components.
+
+== Cassandra Sidecar Configuration
+
+Analytics library uses Sidecar to interact with Cassandra cluster. Bulk reader
and writer components share common
+Sidecar configuration properties.
+
+[cols="1,1,2"]
+|===
+|Property name|Default|Description
+
+|_sidecar_contact_points_
+|
+|Comma-separated list of Cassandra Sidecar contact points. IP addresses and
FQDN domain names are supported,
+with an optional port number (e.g. `localhost1,localhost2`,
`127.0.0.1,127.0.0.2`, `127.0.0.1:9043,127.0.0.2:9043`)
+
+|_sidecar_port_
+|`9043`
+|Default port on which Cassandra Sidecar listens
+
+|_keystore_path_
+|
+|Path to keystore used to establish TLS connection with Cassandra Sidecar
+
+|_keystore_base64_encoded_
+|
+|Base64-encoded keystore used to establish TLS connection with Cassandra
Sidecar
+
+|_keystore_password_
+|
+|Keystore password
+
+|_keystore_type_
+|`PKCS12`
+|Keystore type, `PKCS12` or `JKS`
+
+|_truststore_path_
+|
+|Path to truststore used to establish TLS connection with Cassandra Sidecar
+
+|_truststore_base64_encoded_
+|
+|Base64-encoded truststore used to establish TLS connection with Cassandra
Sidecar
+
+|_truststore_password_
+|
+|Truststore password
+
+|_truststore_type_
+|`PKCS12`
+|Truststore type, `PKCS12` or `JKS`
+
+|_cassandra_role_
+|
+|Specific role that Sidecar shall use to authorize the request. For further
details consult Sidecar documentation
+for `cassandra-auth-role` HTTP header
+
+|===
+
+== Bulk Reader
+
+This section describes configuration properties specific to the bulk reader.
+
+=== Cassandra Sidecar Configuration
+
+[cols="1,1,2"]
+|===
+|Property name|Default|Description
+
+|_defaultMillisToSleep_
+|`500`
+|Number of milliseconds to wait between retry attempts
+
+|_maxMillisToSleep_
+|`60000`
+|Maximum number of milliseconds to sleep between retries
+
+|_maxPoolSize_
+|`64`
+|Size of the Vert.x worker thread pool
+
+|_timeoutSeconds_
+|`600`
+|Request timeout, expressed in seconds
+
+|===
+
+=== Spark Reader Configuration
+
+[cols="1,1,2"]
+|===
+|Property name|Default|Description
+
+|_keyspace_
+|
+|Keyspace of a table to read
+
+|_table_
+|
+|Table to be read
+
+|_dc_
+|
+|Data center used when `LOCAL_*` consistency level is specified
+
+|_consistencyLevel_
+|`LOCAL_QUORUM`
+|Read consistency level
+
+|_snapshotName_
+|`sbr_\{uuid\}`
+|Name of a snapshot to use (for data consistency). By default, unique name is
always generated
+
+|_createSnapshot_
+|`true`
+|Indicates whether a new snapshot should be created prior to performing the
read operation
+
+|_clearSnapshotStrategy_
+|`OnCompletionOrTTL 2d`
+|Strategy of removing snapshot once read operation completes. This option is
enabled always when _createSnapshot_
+flag is set to `true`. Value of _clearSnapshotStrategy_ must follow the
format: `[strategy] [snapshotTTL]`. Supported
+strategies: `NoOp`, `OnCompletion`, `OnCompletionOrTTL`, `TTL`. Example
configurations: `OnCompletionOrTTL 2d`,
+`TTL 2d`, `NoOp`, `OnCompletion`. TTL value has to match pattern:
`\d+(d\|h\|m\|s)`
+
+|_bigNumberConfig_
+|
+a|Defines the output scale and precision of `decimal` and `varint` columns.
Parameter value is a JSON string
+with the following structure:
+
+[source,json]
+----
+{
+ "columnName1" : {"bigDecimalPrecision": 10, "bigDecimalScale": 5},
+ "columnName2" : {"bigIntegerPrecision": 10, "bigIntegerScale": 5}
+}
+----
+
+|_lastModifiedColumnName_
+|
+|Name of the field to be appended to Spark RDD that represents last
modification timestamp of each row
+
+|===
+
+=== Other Properties
+
+[cols="1,1,2"]
+|===
+|Property name|Default|Description
+
+|_defaultParallelism_
+|`1`
+|Value of Spark property `spark.default.parallelism`
+
+|_numCores_
+|`1`
+|Total number of cores used by all Spark executors
+
+|_maxBufferSizeBytes_
+|`6291456`
+a|Maximum amount of bytes per sstable file that may be downloaded and buffered
in-memory. This parameter is
+global default and can be overridden per sstable file type. Effective defaults
are:
+
+- `Data.db`: 6291456
+- `Index.db`: 131072
+- `Summary.db`: 262144
+- `Statistics.db`: 131072
+- `CompressionInfo.db`: 131072
+- `.log` (commit log): 65536
+- `Partitions.db`: 131072
+- `Rows.db`: 131072
+
+To override size for `Data.db`, use property `_maxBufferSizeBytes_Data.db_`.
+
+|_chunkBufferSizeBytes_
+|`4194304`
+a|Default chunk size (in bytes) that will be requested when fetching next
portion of sstable file. This parameter is
+global default and can be overridden per sstable file type. Effective defaults
are:
+
+- `Data.db`: 4194304
+- `Index.db`: 32768
+- `Summary.db`: 131072
+- `Statistics.db`: 65536
+- `CompressionInfo.db`: 65536
+- `.log` (commit log): 65536
+- `Partitions.db`: 4096
+- `Rows.db`: 4096
+
+To override size for `Data.db`, use property `_chunkBufferSizeBytes_Data.db_`.
+
+|_sizing_
+|`default`
+a|Determines how the number of CPU cores is selected during the read
operation. Supported options:
+
+* `default`: static number of cores defined by _numCores_ parameter
+* `dynamic`: calculates number of cores dynamically based on table size.
Improves cost efficiency for processing small
+tables (few GBs). Consult JavaDoc of
`org.apache.cassandra.spark.data.DynamicSizing` for implementation details.
+Relevant configuration properties:
+ ** _maxPartitionSize_: maximum Spark partition size (in GiB)
+
+|_quote_identifiers_
+|`false`
+|When `true`, keyspace, table and column names are quoted
+
+|_sstable_start_timestamp_micros_ and _sstable_end_timestamp_micros_
+|
+|Define an inclusive time-range filter for sstable selection. Both timestamps
are expressed in microseconds
+
+|===
+
+== Bulk Writer
+
+This section describes configuration properties specific to the bulk writer.
+
+=== Spark Writer Configuration
+
+[cols="1,1,2"]
+|===
+|Property name|Default|Description
+
+|_keyspace_
+|
+|Keyspace of a table to write
+
+|_table_
+|
+|Table to which rows are written or from which rows are removed depending on
_write_mode_
+
+|_local_dc_
+|
+|Data center used when `LOCAL_*` consistency level is specified
+
+|_bulk_writer_cl_
+|`EACH_QUORUM`
+|Write consistency level
+
+|_write_mode_
+|`INSERT`
+a|Determines write mode:
+
+* `INSERT`: Writes new rows to the table. Generated sstables contain the data
to be inserted
+* `DELETE_PARTITION`: Removes entire partitions from the table. Only partition
key columns are required in the input data
+
+
+|_ttl_
+|
+|Time-to-live value (in seconds) applied to created records. When specified,
all inserted rows will expire after
+given duration. Only applicable in `INSERT` mode. Example: `86400` for 1 day
TTL
+
+|_timestamp_
+|`NOW`
+a|Mutation timestamp assigned to generated rows, expressed in microseconds.
Options:
+
+* `NOW`: Uses current system time at write execution
+* Custom value: Specify exact timestamp in microseconds (e.g.,
`1609459200000000` for 2021-01-01 00:00:00 UTC)
+
+Custom timestamps affect conflict resolution in Cassandra (last-write-wins)
+
+|_skip_extended_verify_
+|`false`
+|Every imported sstable is verified for corruption during import process. This
property allows to enable extended
+checking of all values in the new sstables
+
+|_quote_identifiers_
+|`false`
+|Option that specifies whether the identifiers (i.e. keyspace, table name,
column names) should be quoted to
+support mixed case and reserved keyword names for these fields
+
+|_data_transport_
+|`DIRECT`
+a|Specifies data transport mode. Supported implementations:
+
+* `DIRECT`: Upload of generated sstables directly to Cassandra cluster via
Sidecar
+* `S3_COMPAT`: Upload of generated sstables to remote S3-compliant storage
+
+|===
+
+=== S3 Upload Properties
+
+[cols="1,1,2"]
+|===
+|Property name|Default|Description
Review Comment:
This table is empty. @yifan-c are there any to fill here or should we just
remove this table?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]