[PR] CASSANALYTICS-42 Add S3-backed Cassandra batch reader [cassandra-analytics]

via GitHub Tue, 12 May 2026 15:10:35 -0700


liucao-dd opened a new pull request, #206:
URL: https://github.com/apache/cassandra-analytics/pull/206


   ## Patch information
   
   Jira: https://issues.apache.org/jira/browse/CASSANALYTICS-42
   
   ## Summary
   
   Introduces an S3-backed Cassandra batch reader so Spark jobs can read 
SSTables directly from object-storage backups without going through a live 
cluster. Built on top of the existing `CassandraDataLayer` / bulk-reader 
abstractions and wired into Spark SQL via a new `S3CassandraDataSource`.
   
   Highlights:
   
   - **Backup reader foundations** 
(`cassandra-analytics-core/.../spark/data/backup/`): `BackupReader`, 
`BackupReaderConfig`, and a factory abstraction so different backup providers 
can plug in.
   - **`S3CassandraDataLayer`**: token-aware partitioning and SSTable selection 
over S3-resident backups, including `SSTableTokenBounds` pruning that correctly 
handles the Murmur3 wrap-around.
   - **Spark SQL integration**: `S3CassandraDataSource`, 
`S3CassandraPrebuiltReadContext(Registry)`, `S3CassandraTokenIndexPrebuilder`, 
plus refinements to `CassandraScanBuilder`, `CassandraPartitioning`, 
`CassandraTable`, and statistics reporting.
   - **Per-task metrics**: a `SparkCustomMetricsStats` implementation and Spark 
`CustomTaskMetric` classes for S3 GET/HEAD latency, summary read latency, 
skipped/corrupt SSTable counts, mutable metadata drift, etc., threaded through 
`BackupReader` read paths via a per-task `Stats` argument.
   - **VersionRunner** is now published via `java-test-fixtures` so downstream 
modules can reuse it.
   
   ## Implementation notes
   
   - New module: `cassandra-analytics-instaclustr/` contains an 
Instaclustr-specific `InstaclustrS3BackupReader` implementation built on the 
new `BackupReader` SPI.
   - No changes to existing public bridge APIs beyond extending 
`CassandraBridgeImplementation` and `SummaryDbUtils` to thread the additional 
`Stats` parameter; previous call sites have backwards-compatible overloads 
where applicable.
   
   ## Test plan
   
   - New unit tests:
     - `S3ClientCacheTest`, `S3ClientConfigTest`, 
`S3DataSourceClientConfig*Test`
     - `S3SSTableLeakTests`, `SSTableTokenIndexTest`
     - `BackupReaderFactorySerializationTest`, 
`S3CassandraPrebuiltReadContextTest`
     - Updated `ReaderUtilsTests`, `SSTableReaderTests`, `SummaryDbTests` for 
the new `Stats`-threaded signatures.
   - End-to-end coverage in `S3CassandraDataLayerTests` (Instaclustr module) 
exercises read paths against fixture SSTables.
   
   ## Reviewer notes
   
   The diff is large (~9.4k LOC added across ~80 files) because the change 
introduces both the SPI and a full provider implementation plus metrics. Happy 
to split into smaller PRs (e.g. backup SPI, Spark SQL plumbing, Instaclustr 
provider, metrics) if reviewers prefer — let me know.
   
   Made with [Cursor](https://cursor.com)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] CASSANALYTICS-42 Add S3-backed Cassandra batch reader [cassandra-analytics]

Reply via email to