(cassandra-analytics) branch trunk updated: CASSANALYTICS-6: User documentation

lukasz Wed, 18 Feb 2026 03:17:37 -0800

This is an automated email from the ASF dual-hosted git repository.

lukasz pushed a commit to branch trunk
in repository https://gitbox.apache.org/repos/asf/cassandra-analytics.git



The following commit(s) were added to refs/heads/trunk by this push:
     new cb74f9f6 CASSANALYTICS-6: User documentation
cb74f9f6 is described below

commit cb74f9f62677465e120f15e1be4d5250935535a5
Author: Lukasz Antoniak <[email protected]>
AuthorDate: Tue Feb 10 15:29:49 2026 +0100

    CASSANALYTICS-6: User documentation
    
    Patch by Lukasz Antoniak; Reviewed by Yifan Cai for CASSANALYTICS-6
---
 build.gradle       |   2 +
 docs/build.gradle  |  28 ++++
 docs/src/user.adoc | 457 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 settings.gradle    |   3 +-
 4 files changed, 489 insertions(+), 1 deletion(-)

diff --git a/build.gradle b/build.gradle
index 396806a8..c556f994 100644
--- a/build.gradle
+++ b/build.gradle
@@ -32,6 +32,8 @@ plugins {
 
   // Release Audit Tool (RAT) plugin for checking project licenses
   id("org.nosphere.apache.rat") version "0.8.1"
+
+  id 'org.asciidoctor.jvm.convert' version '3.3.2'
 }
 
 repositories {
diff --git a/docs/build.gradle b/docs/build.gradle
new file mode 100644
index 00000000..e11f3703
--- /dev/null
+++ b/docs/build.gradle
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+apply plugin: 'org.asciidoctor.jvm.convert'
+
+asciidoctor {
+    sourceDir = file("src")
+    outputDir = file("build")
+    attributes(
+            'project-version': project.version
+    )
+}
diff --git a/docs/src/user.adoc b/docs/src/user.adoc
new file mode 100644
index 00000000..57539155
--- /dev/null
+++ b/docs/src/user.adoc
@@ -0,0 +1,457 @@
+= Overview
+
+This document describes the configuration options available for the bulk 
reader and bulk writer components.
+
+== Cassandra Sidecar Configuration
+
+Cassandra Analytics library uses 
https://github.com/apache/cassandra-sidecar[Apache Cassandra Sidecar] to 
interact
+with target cluster. Bulk reader and writer components share common Sidecar 
configuration properties.
+
+[cols="2,1,1,3"]
+|===
+|Property name|Required|Default|Description
+
+|_sidecar_contact_points_
+|yes
+|
+|Comma-separated list of Cassandra Sidecar contact points. IP addresses and 
FQDN domain names are supported,
+with an optional port number (e.g. `localhost1,localhost2`, 
`127.0.0.1,127.0.0.2`, `127.0.0.1:9043,127.0.0.2:9043`)
+
+|_sidecar_port_
+|no
+|`9043`
+|Default port on which Cassandra Sidecar listens
+
+|_keystore_path_
+|no
+|
+|Path to keystore used to establish TLS connection with Cassandra Sidecar
+
+|_keystore_base64_encoded_
+|no
+|
+|Base64-encoded keystore used to establish TLS connection with Cassandra 
Sidecar
+
+|_keystore_password_
+|no
+|
+|Keystore password
+
+|_keystore_type_
+|no
+|`PKCS12`
+|Keystore type, `PKCS12` or `JKS`
+
+|_truststore_path_
+|no
+|
+|Path to truststore used to establish TLS connection with Cassandra Sidecar
+
+|_truststore_base64_encoded_
+|no
+|
+|Base64-encoded truststore used to establish TLS connection with Cassandra 
Sidecar
+
+|_truststore_password_
+|no
+|
+|Truststore password
+
+|_truststore_type_
+|no
+|`PKCS12`
+|Truststore type, `PKCS12` or `JKS`
+
+|_cassandra_role_
+|no
+|
+|Specific role that Sidecar shall use to authorize the request. For further 
details consult Sidecar documentation
+for `cassandra-auth-role` HTTP header
+
+|===
+
+== Bulk Reader
+
+This section describes configuration properties specific to the bulk reader.
+
+=== Cassandra Sidecar Configuration
+
+[cols="2,1,1,3"]
+|===
+|Property name|Required|Default|Description
+
+|_defaultMillisToSleep_
+|no
+|`500`
+|Number of milliseconds to wait between retry attempts
+
+|_maxMillisToSleep_
+|no
+|`60000`
+|Maximum number of milliseconds to sleep between retries
+
+|_maxPoolSize_
+|no
+|`64`
+|Size of the Vert.x worker thread pool
+
+|_timeoutSeconds_
+|no
+|`600`
+|Request timeout, expressed in seconds
+
+|===
+
+=== Spark Reader Configuration
+
+[cols="2,1,1,3"]
+|===
+|Property name|Required|Default|Description
+
+|_keyspace_
+|yes
+|
+|Keyspace of a table to read
+
+|_table_
+|yes
+|
+|Table to be read
+
+|_dc_
+|no
+|
+|Data center used when `LOCAL_*` consistency level is specified
+
+|_consistencyLevel_
+|no
+|`LOCAL_QUORUM`
+|Read consistency level
+
+|_snapshotName_
+|no
+|`sbr_\{uuid\}`
+|Name of a snapshot to use (for data consistency). By default, unique name is 
always generated
+
+|_createSnapshot_
+|no
+|`true`
+|Indicates whether a new snapshot should be created prior to performing the 
read operation
+
+|_clearSnapshotStrategy_
+|no
+|`OnCompletionOrTTL 2d`
+a|Strategy of removing snapshot once read operation completes. This option is 
enabled always when _createSnapshot_
+flag is set to `true`. Value of _clearSnapshotStrategy_ must follow the 
format: `[strategy] [snapshotTTL]`.
+
+Supported strategies: `NoOp`, `OnCompletion`, `OnCompletionOrTTL`, `TTL`.
+
+TTL value has to match pattern: `\d+(d\|h\|m\|s)`
+
+Example configurations: `OnCompletionOrTTL 2d`, `TTL 2d`, `NoOp`, 
`OnCompletion`.
+
+|_bigNumberConfig_
+|no
+|
+a|Defines the output scale and precision of `decimal` and `varint` columns. 
Parameter value is a JSON string
+with the following structure:
+
+[source,json]
+----
+{
+  "column_name_1" : {"bigDecimalPrecision": 10, "bigDecimalScale": 5},
+  "column_name_2" : {"bigIntegerPrecision": 10, "bigIntegerScale": 5}
+}
+----
+
+|_lastModifiedColumnName_
+|no
+|
+|Name of the field to be appended to Spark RDD that represents last 
modification timestamp of each row
+
+|===
+
+=== Other Properties
+
+[cols="2,1,1,3"]
+|===
+|Property name|Required|Default|Description
+
+|_defaultParallelism_
+|recommended
+|`1`
+|Value of Spark property `spark.default.parallelism`
+
+|_numCores_
+|recommended
+|`1`
+|Total number of cores used by all Spark executors
+
+|_maxBufferSizeBytes_
+|no
+|`6291456`
+a|Maximum amount of bytes per sstable file that may be downloaded and buffered 
in-memory. This parameter is
+global default and can be overridden per sstable file type. Effective defaults 
are:
+
+- `Data.db`: 6291456
+- `Index.db`: 131072
+- `Summary.db`: 262144
+- `Statistics.db`: 131072
+- `CompressionInfo.db`: 131072
+- `.log` (commit log): 65536
+- `Partitions.db`: 131072
+- `Rows.db`: 131072
+
+To override size for `Data.db`, use property `_maxBufferSizeBytes_Data.db_`.
+
+|_chunkBufferSizeBytes_
+|no
+|`4194304`
+a|Default chunk size (in bytes) that will be requested when fetching next 
portion of sstable file. This parameter is
+global default and can be overridden per sstable file type. Effective defaults 
are:
+
+- `Data.db`: 4194304
+- `Index.db`: 32768
+- `Summary.db`: 131072
+- `Statistics.db`: 65536
+- `CompressionInfo.db`: 65536
+- `.log` (commit log): 65536
+- `Partitions.db`: 4096
+- `Rows.db`: 4096
+
+To override size for `Data.db`, use property `_chunkBufferSizeBytes_Data.db_`.
+
+|_sizing_
+|no
+|`default`
+a|Determines how the number of CPU cores is selected during the read 
operation. Supported options:
+
+* `default`: static number of cores defined by _numCores_ parameter
+* `dynamic`: calculates number of cores dynamically based on table size. 
Improves cost efficiency for processing small
+tables (few GBs). Consult JavaDoc of 
`org.apache.cassandra.spark.data.DynamicSizing` for implementation details.
+Relevant configuration properties:
+    ** _maxPartitionSize_: maximum Spark partition size (in GiB)
+
+|_quote_identifiers_
+|no
+|`false`
+|When `true`, keyspace, table and column names are quoted
+
+|_sstable_start_timestamp_micros_ and _sstable_end_timestamp_micros_
+|no
+|all sstables are selected
+|Define an inclusive time-range filter for sstable selection. Both timestamps 
are expressed in microseconds
+
+|===
+
+== Bulk Writer
+
+This section describes configuration properties specific to the bulk writer.
+
+=== Spark Writer Configuration
+
+[cols="2,1,1,3"]
+|===
+|Property name|Required|Default|Description
+
+|_keyspace_
+|yes
+|
+|Keyspace of a table to write
+
+|_table_
+|yes
+|
+|Table to which rows are written or from which rows are removed depending on 
_write_mode_
+
+|_local_dc_
+|no
+|
+|Data center used when `LOCAL_*` consistency level is specified
+
+|_bulk_writer_cl_
+|no
+|`EACH_QUORUM`
+|Write consistency level
+
+|_write_mode_
+|no
+|`INSERT`
+a|Determines write mode:
+
+* `INSERT`: Writes new rows to the table. Generated sstables contain the data 
to be inserted
+* `DELETE_PARTITION`: Removes entire partitions from the table. Only partition 
key columns are required in the input data
+
+
+|_ttl_
+|no
+|
+|Time-to-live value (in seconds) applied to created records. When specified, 
all inserted rows will expire after
+given duration. Only applicable in `INSERT` mode. Example: `86400` for 1 day 
TTL
+
+|_timestamp_
+|no
+|`NOW`
+a|Mutation timestamp assigned to generated rows, expressed in microseconds. 
Options:
+
+* `NOW`: Uses current system time at write execution
+* Custom value: Specify exact timestamp in microseconds (e.g., 
`1609459200000000` for 2021-01-01 00:00:00 UTC)
+
+Custom timestamps affect conflict resolution in Cassandra (last-write-wins)
+
+|_skip_extended_verify_
+|no
+|`false`
+|Every imported sstable is verified for corruption during import process. This 
property allows to enable extended
+checking of all values in the new sstables
+
+|_quote_identifiers_
+|no
+|`false`
+|Option that specifies whether the identifiers (i.e. keyspace, table name, 
column names) should be quoted to
+support mixed case and reserved keyword names for these fields
+
+|_data_transport_
+|no
+|`DIRECT`
+a|Specifies data transport mode. Supported implementations:
+
+* `DIRECT`: Uploads generated sstables directly to Cassandra cluster via 
Sidecar
+* `S3_COMPAT`: Uploads generated sstables to single (or multiple) remote 
Cassandra clusters with intermediate S3 storage
+(see <<Cloud-Storage Transport Properties>>)
+
+|===
+
+=== Cloud-Storage Transport Properties
+
+Cassandra Analytics can import generated sstables to target Cassandra cluster 
with intermediate S3 storage. This
+feature is explicitly useful when importing the same data set to multiple 
clusters running in remote locations.
+Write to multiple clusters involves data replication by S3 and 2-phase 
coordination protocol:
+
+1. SSTables are staged on all clusters and data consistency checked.
+2. Proceed with the import and validate consistency in each cluster.
+
+In case of import to a single cluster, there is no need for coordination and 
data gets persisted in Cassandra,
+as soon as it can be downloaded from S3 storage.
+
+[cols="2,1,1,3"]
+|===
+|Property name|Required|Default|Description
+
+|_coordinated_write_config_
+|no
+|
+a|
+Configuration of coordinated write operation in JSON format. Lists all remote 
Cassandra clusters to write,
+together with list of local Sidecar instances. This property is required only 
when importing data to multiple
+Cassandra clusters.
+
+Example:
+
+[source,json]
+----
+{
+  "cluster1": {
+    "sidecarContactPoints": [
+      "instance-1:9999",
+      "instance-2:9999",
+      "instance-3:9999"
+    ],
+    "localDc": "dc1",
+    "writeToLocalDcOnly": false
+  },
+  "cluster2": {
+    "sidecarContactPoints": [
+      "instance-4:8888"
+    ],
+    "localDc": "dc2",
+    "writeToLocalDcOnly": false
+  }
+}
+----
+
+|_data_transport_extension_class_
+|yes
+|
+|Fully qualified class name that implements `StorageTransportExtension` 
interface. Consult JavaDoc for
+implementation details
+
+|_storage_client_endpoint_override_
+|no
+|
+|Property overrides S3 endpoint
+
+|_storage_client_https_proxy_
+|no
+|
+|HTTPS proxy for S3 client
+
+|_max_size_per_sstable_bundle_in_bytes_s3_transport_
+|no
+|`5368709120`
+|Limits the maximum size of uploaded S3 object
+
+|_storage_client_max_chunk_size_in_bytes_
+|no
+|`104857600`
+|Specifies maximum chunk size for multipart S3 upload
+
+|_storage_client_concurrency_
+|no
+|`CPU cores * 2`
+|Controls the max parallelism of the thread pool used by S3 client
+
+|_storage_client_thread_keep_alive_seconds_
+|no
+|60
+|Idle storage thread timeout in seconds
+
+|_storage_client_nio_http_client_connection_acquisition_timeout_seconds_
+|no
+|`300`
+|Option to tune the connection acquisition timeout for NIO HTTP component 
employed in S3 client
+
+|_storage_client_nio_http_client_max_concurrency_
+|no
+|`50`
+|Specifies concurrency of the NIO HTTP component employed in S3 client
+
+|===
+
+=== Other Properties
+
+[cols="2,1,1,3"]
+|===
+|Property name|Required|Default|Description
+
+|_number_splits_
+|no
+|`-1`
+|User defined number of token range splits. By default, library will 
dynamically calculate number of splits based
+on Spark properties `spark.default.parallelism`, `spark.executor.cores` and 
`spark.executor.instances`
+
+|_sstable_data_size_in_mib_
+|no
+|`160`
+|Maximum sstable size (in MiB)
+
+|_digest_
+|no
+|`XXHash32`
+|Digest algorithm used to compute when uploading sstables for checksum 
validation. Supported values: `XXHash32`, `MD5`
+
+|_job_timeout_seconds_
+|no
+|`-1`
+a|Specifies a timeout in seconds for bulk write jobs. Disabled by default. 
When configured, job exceeding
+the timeout is:
+
+* successful when the desired consistency level is achieved
+* failed otherwise
+
+|_job_id_
+|no
+|
+|User-defined identifier for the bulk write job
+
+|===
\ No newline at end of file
diff --git a/settings.gradle b/settings.gradle
index 5d769887..86e917c9 100644
--- a/settings.gradle
+++ b/settings.gradle
@@ -50,4 +50,5 @@ include 'cassandra-analytics-cdc-codec'
 include 'analytics-sidecar-vertx-client-shaded'
 include 'analytics-sidecar-vertx-client'
 include 'analytics-sidecar-client'
-include 'analytics-sidecar-client-common'
\ No newline at end of file
+include 'analytics-sidecar-client-common'
+include 'docs'
\ No newline at end of file


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(cassandra-analytics) branch trunk updated: CASSANALYTICS-6: User documentation

Reply via email to