liucao-dd opened a new pull request, #206: URL: https://github.com/apache/cassandra-analytics/pull/206
## Patch information Jira: https://issues.apache.org/jira/browse/CASSANALYTICS-42 ## Summary Introduces an S3-backed Cassandra batch reader so Spark jobs can read SSTables directly from object-storage backups without going through a live cluster. Built on top of the existing `CassandraDataLayer` / bulk-reader abstractions and wired into Spark SQL via a new `S3CassandraDataSource`. Highlights: - **Backup reader foundations** (`cassandra-analytics-core/.../spark/data/backup/`): `BackupReader`, `BackupReaderConfig`, and a factory abstraction so different backup providers can plug in. - **`S3CassandraDataLayer`**: token-aware partitioning and SSTable selection over S3-resident backups, including `SSTableTokenBounds` pruning that correctly handles the Murmur3 wrap-around. - **Spark SQL integration**: `S3CassandraDataSource`, `S3CassandraPrebuiltReadContext(Registry)`, `S3CassandraTokenIndexPrebuilder`, plus refinements to `CassandraScanBuilder`, `CassandraPartitioning`, `CassandraTable`, and statistics reporting. - **Per-task metrics**: a `SparkCustomMetricsStats` implementation and Spark `CustomTaskMetric` classes for S3 GET/HEAD latency, summary read latency, skipped/corrupt SSTable counts, mutable metadata drift, etc., threaded through `BackupReader` read paths via a per-task `Stats` argument. - **VersionRunner** is now published via `java-test-fixtures` so downstream modules can reuse it. ## Implementation notes - New module: `cassandra-analytics-instaclustr/` contains an Instaclustr-specific `InstaclustrS3BackupReader` implementation built on the new `BackupReader` SPI. - No changes to existing public bridge APIs beyond extending `CassandraBridgeImplementation` and `SummaryDbUtils` to thread the additional `Stats` parameter; previous call sites have backwards-compatible overloads where applicable. ## Test plan - New unit tests: - `S3ClientCacheTest`, `S3ClientConfigTest`, `S3DataSourceClientConfig*Test` - `S3SSTableLeakTests`, `SSTableTokenIndexTest` - `BackupReaderFactorySerializationTest`, `S3CassandraPrebuiltReadContextTest` - Updated `ReaderUtilsTests`, `SSTableReaderTests`, `SummaryDbTests` for the new `Stats`-threaded signatures. - End-to-end coverage in `S3CassandraDataLayerTests` (Instaclustr module) exercises read paths against fixture SSTables. ## Reviewer notes The diff is large (~9.4k LOC added across ~80 files) because the change introduces both the SPI and a full provider implementation plus metrics. Happy to split into smaller PRs (e.g. backup SPI, Spark SQL plumbing, Instaclustr provider, metrics) if reviewers prefer — let me know. Made with [Cursor](https://cursor.com) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
