Re: [PR] CASSANALYTICS-6: User documentation [cassandra-analytics]

via GitHub Mon, 16 Feb 2026 06:33:32 -0800


lukasz-antoniak commented on code in PR #168:
URL: 
https://github.com/apache/cassandra-analytics/pull/168#discussion_r2812720602



##########
docs/src/user.adoc:
##########
@@ -0,0 +1,450 @@
+= Overview
+
+This document describes the configuration options available for the bulk 
reader and bulk writer components.
+
+== Cassandra Sidecar Configuration
+
+Cassandra Analytics library uses 
https://github.com/apache/cassandra-sidecar[Apache Cassandra Sidecar] to 
interact
+with target cluster. Bulk reader and writer components share common Sidecar 
configuration properties.
+
+[cols="2,1,1,3"]
+|===
+|Property name|Required|Default|Description
+
+|_sidecar_contact_points_
+|yes
+|
+|Comma-separated list of Cassandra Sidecar contact points. IP addresses and 
FQDN domain names are supported,
+with an optional port number (e.g. `localhost1,localhost2`, 
`127.0.0.1,127.0.0.2`, `127.0.0.1:9043,127.0.0.2:9043`)
+
+|_sidecar_port_
+|no
+|`9043`
+|Default port on which Cassandra Sidecar listens
+
+|_keystore_path_
+|no
+|
+|Path to keystore used to establish TLS connection with Cassandra Sidecar
+
+|_keystore_base64_encoded_
+|no
+|
+|Base64-encoded keystore used to establish TLS connection with Cassandra 
Sidecar
+
+|_keystore_password_
+|no
+|
+|Keystore password
+
+|_keystore_type_
+|no
+|`PKCS12`
+|Keystore type, `PKCS12` or `JKS`
+
+|_truststore_path_
+|no
+|
+|Path to truststore used to establish TLS connection with Cassandra Sidecar
+
+|_truststore_base64_encoded_
+|no
+|
+|Base64-encoded truststore used to establish TLS connection with Cassandra 
Sidecar
+
+|_truststore_password_
+|no
+|
+|Truststore password
+
+|_truststore_type_
+|no
+|`PKCS12`
+|Truststore type, `PKCS12` or `JKS`
+
+|_cassandra_role_
+|no
+|
+|Specific role that Sidecar shall use to authorize the request. For further 
details consult Sidecar documentation
+for `cassandra-auth-role` HTTP header
+
+|===
+
+== Bulk Reader
+
+This section describes configuration properties specific to the bulk reader.
+
+=== Cassandra Sidecar Configuration
+
+[cols="2,1,1,3"]
+|===
+|Property name|Required|Default|Description
+
+|_defaultMillisToSleep_
+|no
+|`500`
+|Number of milliseconds to wait between retry attempts
+
+|_maxMillisToSleep_
+|no
+|`60000`
+|Maximum number of milliseconds to sleep between retries
+
+|_maxPoolSize_
+|no
+|`64`
+|Size of the Vert.x worker thread pool
+
+|_timeoutSeconds_
+|no
+|`600`
+|Request timeout, expressed in seconds
+
+|===
+
+=== Spark Reader Configuration
+
+[cols="2,1,1,3"]
+|===
+|Property name|Required|Default|Description
+
+|_keyspace_
+|yes
+|
+|Keyspace of a table to read
+
+|_table_
+|yes
+|
+|Table to be read
+
+|_dc_
+|no
+|
+|Data center used when `LOCAL_*` consistency level is specified
+
+|_consistencyLevel_
+|no
+|`LOCAL_QUORUM`
+|Read consistency level
+
+|_snapshotName_
+|no
+|`sbr_\{uuid\}`
+|Name of a snapshot to use (for data consistency). By default, unique name is 
always generated
+
+|_createSnapshot_
+|no
+|`true`
+|Indicates whether a new snapshot should be created prior to performing the 
read operation
+
+|_clearSnapshotStrategy_
+|no
+|`OnCompletionOrTTL 2d`
+a|Strategy of removing snapshot once read operation completes. This option is 
enabled always when _createSnapshot_
+flag is set to `true`. Value of _clearSnapshotStrategy_ must follow the 
format: `[strategy] [snapshotTTL]`.
+
+Supported strategies: `NoOp`, `OnCompletion`, `OnCompletionOrTTL`, `TTL`.
+
+TTL value has to match pattern: `\d+(d\|h\|m\|s)`
+
+Example configurations: `OnCompletionOrTTL 2d`, `TTL 2d`, `NoOp`, 
`OnCompletion`.
+
+|_bigNumberConfig_
+|no
+|
+a|Defines the output scale and precision of `decimal` and `varint` columns. 
Parameter value is a JSON string
+with the following structure:
+
+[source,json]
+----
+{
+  "column_name_1" : {"bigDecimalPrecision": 10, "bigDecimalScale": 5},
+  "column_name_2" : {"bigIntegerPrecision": 10, "bigIntegerScale": 5}
+}
+----
+
+|_lastModifiedColumnName_
+|no
+|
+|Name of the field to be appended to Spark RDD that represents last 
modification timestamp of each row
+
+|===
+
+=== Other Properties
+
+[cols="2,1,1,3"]
+|===
+|Property name|Required|Default|Description
+
+|_defaultParallelism_
+|recommended
+|`1`
+|Value of Spark property `spark.default.parallelism`
+
+|_numCores_
+|recommended
+|`1`
+|Total number of cores used by all Spark executors
+
+|_maxBufferSizeBytes_
+|no
+|`6291456`
+a|Maximum amount of bytes per sstable file that may be downloaded and buffered 
in-memory. This parameter is
+global default and can be overridden per sstable file type. Effective defaults 
are:
+
+- `Data.db`: 6291456
+- `Index.db`: 131072
+- `Summary.db`: 262144
+- `Statistics.db`: 131072
+- `CompressionInfo.db`: 131072
+- `.log` (commit log): 65536
+- `Partitions.db`: 131072
+- `Rows.db`: 131072
+
+To override size for `Data.db`, use property `_maxBufferSizeBytes_Data.db_`.
+
+|_chunkBufferSizeBytes_
+|no
+|`4194304`
+a|Default chunk size (in bytes) that will be requested when fetching next 
portion of sstable file. This parameter is
+global default and can be overridden per sstable file type. Effective defaults 
are:
+
+- `Data.db`: 4194304
+- `Index.db`: 32768
+- `Summary.db`: 131072
+- `Statistics.db`: 65536
+- `CompressionInfo.db`: 65536
+- `.log` (commit log): 65536
+- `Partitions.db`: 4096
+- `Rows.db`: 4096
+
+To override size for `Data.db`, use property `_chunkBufferSizeBytes_Data.db_`.
+
+|_sizing_
+|no
+|`default`
+a|Determines how the number of CPU cores is selected during the read 
operation. Supported options:
+
+* `default`: static number of cores defined by _numCores_ parameter
+* `dynamic`: calculates number of cores dynamically based on table size. 
Improves cost efficiency for processing small
+tables (few GBs). Consult JavaDoc of 
`org.apache.cassandra.spark.data.DynamicSizing` for implementation details.
+Relevant configuration properties:
+    ** _maxPartitionSize_: maximum Spark partition size (in GiB)
+
+|_quote_identifiers_
+|no
+|`false`
+|When `true`, keyspace, table and column names are quoted
+
+|_sstable_start_timestamp_micros_ and _sstable_end_timestamp_micros_
+|no
+|
+|Define an inclusive time-range filter for sstable selection. Both timestamps 
are expressed in microseconds
+
+|===
+
+== Bulk Writer
+
+This section describes configuration properties specific to the bulk writer.
+
+=== Spark Writer Configuration
+
+[cols="2,1,1,3"]
+|===
+|Property name|Required|Default|Description
+
+|_keyspace_
+|yes
+|
+|Keyspace of a table to write
+
+|_table_
+|yes
+|
+|Table to which rows are written or from which rows are removed depending on 
_write_mode_
+
+|_local_dc_
+|no
+|
+|Data center used when `LOCAL_*` consistency level is specified
+
+|_bulk_writer_cl_
+|no
+|`EACH_QUORUM`
+|Write consistency level
+
+|_write_mode_
+|no
+|`INSERT`
+a|Determines write mode:
+
+* `INSERT`: Writes new rows to the table. Generated sstables contain the data 
to be inserted
+* `DELETE_PARTITION`: Removes entire partitions from the table. Only partition 
key columns are required in the input data
+
+
+|_ttl_
+|no
+|
+|Time-to-live value (in seconds) applied to created records. When specified, 
all inserted rows will expire after
+given duration. Only applicable in `INSERT` mode. Example: `86400` for 1 day 
TTL
+
+|_timestamp_
+|no
+|`NOW`
+a|Mutation timestamp assigned to generated rows, expressed in microseconds. 
Options:
+
+* `NOW`: Uses current system time at write execution
+* Custom value: Specify exact timestamp in microseconds (e.g., 
`1609459200000000` for 2021-01-01 00:00:00 UTC)
+
+Custom timestamps affect conflict resolution in Cassandra (last-write-wins)
+
+|_skip_extended_verify_
+|no
+|`false`
+|Every imported sstable is verified for corruption during import process. This 
property allows to enable extended
+checking of all values in the new sstables
+
+|_quote_identifiers_
+|no
+|`false`
+|Option that specifies whether the identifiers (i.e. keyspace, table name, 
column names) should be quoted to
+support mixed case and reserved keyword names for these fields
+
+|_data_transport_
+|no
+|`DIRECT`
+a|Specifies data transport mode. Supported implementations:
+
+* `DIRECT`: Uploads generated sstables directly to Cassandra cluster via 
Sidecar
+* `S3_COMPAT`: Uploads generated sstables to multiple remote Cassandra 
clusters with intermediate S3 storage

Review Comment:
   Thank you for clarification.



##########
docs/src/user.adoc:
##########
@@ -0,0 +1,450 @@
+= Overview
+
+This document describes the configuration options available for the bulk 
reader and bulk writer components.
+
+== Cassandra Sidecar Configuration
+
+Cassandra Analytics library uses 
https://github.com/apache/cassandra-sidecar[Apache Cassandra Sidecar] to 
interact
+with target cluster. Bulk reader and writer components share common Sidecar 
configuration properties.
+
+[cols="2,1,1,3"]
+|===
+|Property name|Required|Default|Description
+
+|_sidecar_contact_points_
+|yes
+|
+|Comma-separated list of Cassandra Sidecar contact points. IP addresses and 
FQDN domain names are supported,
+with an optional port number (e.g. `localhost1,localhost2`, 
`127.0.0.1,127.0.0.2`, `127.0.0.1:9043,127.0.0.2:9043`)
+
+|_sidecar_port_
+|no
+|`9043`
+|Default port on which Cassandra Sidecar listens
+
+|_keystore_path_
+|no
+|
+|Path to keystore used to establish TLS connection with Cassandra Sidecar
+
+|_keystore_base64_encoded_
+|no
+|
+|Base64-encoded keystore used to establish TLS connection with Cassandra 
Sidecar
+
+|_keystore_password_
+|no
+|
+|Keystore password
+
+|_keystore_type_
+|no
+|`PKCS12`
+|Keystore type, `PKCS12` or `JKS`
+
+|_truststore_path_
+|no
+|
+|Path to truststore used to establish TLS connection with Cassandra Sidecar
+
+|_truststore_base64_encoded_
+|no
+|
+|Base64-encoded truststore used to establish TLS connection with Cassandra 
Sidecar
+
+|_truststore_password_
+|no
+|
+|Truststore password
+
+|_truststore_type_
+|no
+|`PKCS12`
+|Truststore type, `PKCS12` or `JKS`
+
+|_cassandra_role_
+|no
+|
+|Specific role that Sidecar shall use to authorize the request. For further 
details consult Sidecar documentation
+for `cassandra-auth-role` HTTP header
+
+|===
+
+== Bulk Reader
+
+This section describes configuration properties specific to the bulk reader.
+
+=== Cassandra Sidecar Configuration
+
+[cols="2,1,1,3"]
+|===
+|Property name|Required|Default|Description
+
+|_defaultMillisToSleep_
+|no
+|`500`
+|Number of milliseconds to wait between retry attempts
+
+|_maxMillisToSleep_
+|no
+|`60000`
+|Maximum number of milliseconds to sleep between retries
+
+|_maxPoolSize_
+|no
+|`64`
+|Size of the Vert.x worker thread pool
+
+|_timeoutSeconds_
+|no
+|`600`
+|Request timeout, expressed in seconds
+
+|===
+
+=== Spark Reader Configuration
+
+[cols="2,1,1,3"]
+|===
+|Property name|Required|Default|Description
+
+|_keyspace_
+|yes
+|
+|Keyspace of a table to read
+
+|_table_
+|yes
+|
+|Table to be read
+
+|_dc_
+|no
+|
+|Data center used when `LOCAL_*` consistency level is specified
+
+|_consistencyLevel_
+|no
+|`LOCAL_QUORUM`
+|Read consistency level
+
+|_snapshotName_
+|no
+|`sbr_\{uuid\}`
+|Name of a snapshot to use (for data consistency). By default, unique name is 
always generated
+
+|_createSnapshot_
+|no
+|`true`
+|Indicates whether a new snapshot should be created prior to performing the 
read operation
+
+|_clearSnapshotStrategy_
+|no
+|`OnCompletionOrTTL 2d`
+a|Strategy of removing snapshot once read operation completes. This option is 
enabled always when _createSnapshot_
+flag is set to `true`. Value of _clearSnapshotStrategy_ must follow the 
format: `[strategy] [snapshotTTL]`.
+
+Supported strategies: `NoOp`, `OnCompletion`, `OnCompletionOrTTL`, `TTL`.
+
+TTL value has to match pattern: `\d+(d\|h\|m\|s)`
+
+Example configurations: `OnCompletionOrTTL 2d`, `TTL 2d`, `NoOp`, 
`OnCompletion`.
+
+|_bigNumberConfig_
+|no
+|
+a|Defines the output scale and precision of `decimal` and `varint` columns. 
Parameter value is a JSON string
+with the following structure:
+
+[source,json]
+----
+{
+  "column_name_1" : {"bigDecimalPrecision": 10, "bigDecimalScale": 5},
+  "column_name_2" : {"bigIntegerPrecision": 10, "bigIntegerScale": 5}
+}
+----
+
+|_lastModifiedColumnName_
+|no
+|
+|Name of the field to be appended to Spark RDD that represents last 
modification timestamp of each row
+
+|===
+
+=== Other Properties
+
+[cols="2,1,1,3"]
+|===
+|Property name|Required|Default|Description
+
+|_defaultParallelism_
+|recommended
+|`1`
+|Value of Spark property `spark.default.parallelism`
+
+|_numCores_
+|recommended
+|`1`
+|Total number of cores used by all Spark executors
+
+|_maxBufferSizeBytes_
+|no
+|`6291456`
+a|Maximum amount of bytes per sstable file that may be downloaded and buffered 
in-memory. This parameter is
+global default and can be overridden per sstable file type. Effective defaults 
are:
+
+- `Data.db`: 6291456
+- `Index.db`: 131072
+- `Summary.db`: 262144
+- `Statistics.db`: 131072
+- `CompressionInfo.db`: 131072
+- `.log` (commit log): 65536
+- `Partitions.db`: 131072
+- `Rows.db`: 131072
+
+To override size for `Data.db`, use property `_maxBufferSizeBytes_Data.db_`.
+
+|_chunkBufferSizeBytes_
+|no
+|`4194304`
+a|Default chunk size (in bytes) that will be requested when fetching next 
portion of sstable file. This parameter is
+global default and can be overridden per sstable file type. Effective defaults 
are:
+
+- `Data.db`: 4194304
+- `Index.db`: 32768
+- `Summary.db`: 131072
+- `Statistics.db`: 65536
+- `CompressionInfo.db`: 65536
+- `.log` (commit log): 65536
+- `Partitions.db`: 4096
+- `Rows.db`: 4096
+
+To override size for `Data.db`, use property `_chunkBufferSizeBytes_Data.db_`.
+
+|_sizing_
+|no
+|`default`
+a|Determines how the number of CPU cores is selected during the read 
operation. Supported options:
+
+* `default`: static number of cores defined by _numCores_ parameter
+* `dynamic`: calculates number of cores dynamically based on table size. 
Improves cost efficiency for processing small
+tables (few GBs). Consult JavaDoc of 
`org.apache.cassandra.spark.data.DynamicSizing` for implementation details.
+Relevant configuration properties:
+    ** _maxPartitionSize_: maximum Spark partition size (in GiB)
+
+|_quote_identifiers_
+|no
+|`false`
+|When `true`, keyspace, table and column names are quoted
+
+|_sstable_start_timestamp_micros_ and _sstable_end_timestamp_micros_
+|no
+|
+|Define an inclusive time-range filter for sstable selection. Both timestamps 
are expressed in microseconds
+
+|===
+
+== Bulk Writer
+
+This section describes configuration properties specific to the bulk writer.
+
+=== Spark Writer Configuration
+
+[cols="2,1,1,3"]
+|===
+|Property name|Required|Default|Description
+
+|_keyspace_
+|yes
+|
+|Keyspace of a table to write
+
+|_table_
+|yes
+|
+|Table to which rows are written or from which rows are removed depending on 
_write_mode_
+
+|_local_dc_
+|no
+|
+|Data center used when `LOCAL_*` consistency level is specified
+
+|_bulk_writer_cl_
+|no
+|`EACH_QUORUM`
+|Write consistency level
+
+|_write_mode_
+|no
+|`INSERT`
+a|Determines write mode:
+
+* `INSERT`: Writes new rows to the table. Generated sstables contain the data 
to be inserted
+* `DELETE_PARTITION`: Removes entire partitions from the table. Only partition 
key columns are required in the input data
+
+
+|_ttl_
+|no
+|
+|Time-to-live value (in seconds) applied to created records. When specified, 
all inserted rows will expire after
+given duration. Only applicable in `INSERT` mode. Example: `86400` for 1 day 
TTL
+
+|_timestamp_
+|no
+|`NOW`
+a|Mutation timestamp assigned to generated rows, expressed in microseconds. 
Options:
+
+* `NOW`: Uses current system time at write execution
+* Custom value: Specify exact timestamp in microseconds (e.g., 
`1609459200000000` for 2021-01-01 00:00:00 UTC)
+
+Custom timestamps affect conflict resolution in Cassandra (last-write-wins)
+
+|_skip_extended_verify_
+|no
+|`false`
+|Every imported sstable is verified for corruption during import process. This 
property allows to enable extended
+checking of all values in the new sstables
+
+|_quote_identifiers_
+|no
+|`false`
+|Option that specifies whether the identifiers (i.e. keyspace, table name, 
column names) should be quoted to
+support mixed case and reserved keyword names for these fields
+
+|_data_transport_
+|no
+|`DIRECT`
+a|Specifies data transport mode. Supported implementations:
+
+* `DIRECT`: Uploads generated sstables directly to Cassandra cluster via 
Sidecar
+* `S3_COMPAT`: Uploads generated sstables to multiple remote Cassandra 
clusters with intermediate S3 storage
+(see <<Multi-cluster Upload Properties>>)
+
+|===
+
+=== Multi-cluster Upload Properties
+
+Cassandra Analytics can import the same set of generated sstables to multiple 
Cassandra clusters running in remote
+locations. Analytics library uploads generated sstables to common S3 storage. 
S3 service replicates data across
+regions and triggers import of files to Cassandra cluster using local Sidecar 
instances.
+
+[cols="2,1,1,3"]
+|===
+|Property name|Required|Default|Description
+
+|_coordinated_write_config_
+|yes
+|
+a|
+Configuration of coordinated write operation in JSON format. Lists all remote 
Cassandra clusters to write,
+together with list of local Sidecar instances.
+
+Example:
+
+[source,json]
+----
+{
+  "cluster1": {
+    "sidecarContactPoints": [
+      "instance-1:9999",
+      "instance-2:9999",
+      "instance-3:9999"
+    ],
+    "localDc": "dc1",
+    "writeToLocalDcOnly": false
+  },
+  "cluster2": {
+    "sidecarContactPoints": [
+      "instance-4:8888"
+    ],
+    "localDc": "dc2",
+    "writeToLocalDcOnly": false
+  }
+}
+----
+
+|_data_transport_extension_class_
+|yes
+|
+|Fully qualified class name that implements `StorageTransportExtension` 
interface. Consult JavaDoc for
+implementation details
+
+|_storage_client_endpoint_override_
+|no
+|
+|Property overrides S3 endpoint
+
+|_storage_client_https_proxy_
+|no
+|
+|HTTPS proxy for S3 client
+
+|_max_size_per_sstable_bundle_in_bytes_s3_transport_
+|no
+|`5368709120`
+|Limits the maximum size of uploaded S3 object
+
+|_storage_client_max_chunk_size_in_bytes_
+|no
+|`104857600`
+|Specifies maximum chunk size for multipart S3 upload
+
+|_storage_client_concurrency_
+|no
+|`CPU cores * 2`
+|Controls the max parallelism of the thread pool used by S3 client
+
+|_storage_client_thread_keep_alive_seconds_
+|no
+|60
+|Idle storage thread timeout in seconds
+
+|_storage_client_nio_http_client_connection_acquisition_timeout_seconds_
+|no
+|`300`
+|Option to tune the connection acquisition timeout for NIO HTTP component 
employed in S3 client
+
+|_storage_client_nio_http_client_max_concurrency_
+|no
+|`50`
+|Specifies concurrency of the NIO HTTP component employed in S3 client

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] CASSANALYTICS-6: User documentation [cassandra-analytics]

Reply via email to