[
https://issues.apache.org/jira/browse/CASSANALYTICS-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18084478#comment-18084478
]
Jon Haddad edited comment on CASSANALYTICS-167 at 5/30/26 12:22 AM:
--------------------------------------------------------------------
+1
I had a couple minor nits in the code but no blockers from me. I know Yifan
had at least one or two comments, so please check with him before merging to
see if he has any requests.
I tested this today against a single-node Cassandra 5.0 cluster with
storage_compatibility_mode: NONE using the DirectBulkWriter Spark job to
bulk-load 10 million rows via the sidecar transport. The first run used the
existing cassandra-analytics library (0.4.0, Spark 3 / Scala 2.12) without the
fix. The resulting SSTables had Filter.db files of only 16 bytes regardless of
SSTable size (oa-14 through oa-16, Data.db sizes ranging from 53–97 MB),
confirming correct bloom filters were not being generated.
A second run was performed with Lukasz's branch, building cassandra-analytics
against Spark 4 / Scala 2.13 and running on EMR emr-spark-8.0.0. The new
SSTables (oa-18 through oa-21) showed Filter.db files of 2.2–4.5 MB
proportional to their Data.db sizes (48–97 MB), confirming that
rebuildFilterComponents is correctly regenerating bloom filters after
CQLSSTableWriter completes.
||SSTable||Run||Filter.db||Data.db||
|oa-14|Before (Scala 2.12, no fix)|16 bytes|53 MB|
|oa-15|Before (Scala 2.12, no fix)|16 bytes|63 MB|
|oa-16|Before (Scala 2.12, no fix)|16 bytes|97 MB|
|oa-17|After (Spark 4 / Scala 2.13, fix applied)|123 KB|2.7 MB|
|oa-18|After (Spark 4 / Scala 2.13, fix applied)|2.2 MB|48 MB|
|oa-19|After (Spark 4 / Scala 2.13, fix applied)|2.4 MB|53 MB|
|oa-20|After (Spark 4 / Scala 2.13, fix applied)|2.9 MB|63 MB|
|oa-21|After (Spark 4 / Scala 2.13, fix applied)|4.5 MB|97 MB|
was (Author: rustyrazorblade):
+1
I had a couple minor nits in the code but no blockers from me. I know Yifan
had at least one or two comments, so please check with him to see if he has any
requests.
I tested this today against a single-node Cassandra 5.0 cluster with
storage_compatibility_mode: NONE using the DirectBulkWriter Spark job to
bulk-load 10 million rows via the sidecar transport. The first run used the
existing cassandra-analytics library (0.4.0, Spark 3 / Scala 2.12) without the
fix. The resulting SSTables had Filter.db files of only 16 bytes regardless of
SSTable size (oa-14 through oa-16, Data.db sizes ranging from 53–97 MB),
confirming correct bloom filters were not being generated.
A second run was performed with Lukasz's branch, building cassandra-analytics
against Spark 4 / Scala 2.13 and running on EMR emr-spark-8.0.0. The new
SSTables (oa-18 through oa-21) showed Filter.db files of 2.2–4.5 MB
proportional to their Data.db sizes (48–97 MB), confirming that
rebuildFilterComponents is correctly regenerating bloom filters after
CQLSSTableWriter completes.
||SSTable||Run||Filter.db||Data.db||
|oa-14|Before (Scala 2.12, no fix)|16 bytes|53 MB|
|oa-15|Before (Scala 2.12, no fix)|16 bytes|63 MB|
|oa-16|Before (Scala 2.12, no fix)|16 bytes|97 MB|
|oa-17|After (Spark 4 / Scala 2.13, fix applied)|123 KB|2.7 MB|
|oa-18|After (Spark 4 / Scala 2.13, fix applied)|2.2 MB|48 MB|
|oa-19|After (Spark 4 / Scala 2.13, fix applied)|2.4 MB|53 MB|
|oa-20|After (Spark 4 / Scala 2.13, fix applied)|2.9 MB|63 MB|
|oa-21|After (Spark 4 / Scala 2.13, fix applied)|4.5 MB|97 MB|
> Regenerate Bloom filters for CQLSSTableWriter-produced SSTables before upload
> -----------------------------------------------------------------------------
>
> Key: CASSANALYTICS-167
> URL: https://issues.apache.org/jira/browse/CASSANALYTICS-167
> Project: Apache Cassandra Analytics
> Issue Type: Improvement
> Components: Writer
> Reporter: Yifan Cai
> Assignee: Lukasz Antoniak
> Priority: Normal
> Time Spent: 1h 40m
> Remaining Estimate: 0h
>
> CQLSSTableWriter produces empty Filter.db files when flushing SSTables. This
> causes Cassandra nodes to skip Bloom filter checks on imported SSTables,
> resulting in unnecessary disk reads for every partition key lookup.
> Fixing CQLSSTableWriter upstream requires a new Cassandra release. As a
> near-term fix, cassandra-analytics will regenerate correct Bloom filters from
> the SSTable's Index.db before uploading.
> Proposed changes:
>
>
>
> - Add rebuildBloomFilter method to CassandraBridge interface with
> implementations in FourZeroBridge and FiveZeroBridge
> - Call the rebuild in SortedSSTableWriter.close() after the SSTable flush and
> before digest computation, so digests cover the correct filter
> Jon Haddad reported the issue.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]