[ 
https://issues.apache.org/jira/browse/CASSANALYTICS-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18084478#comment-18084478
 ] 

Jon Haddad edited comment on CASSANALYTICS-167 at 5/30/26 12:22 AM:
--------------------------------------------------------------------

+1

I had a couple minor nits in the code but no blockers from me.  I know Yifan 
had at least one or two comments, so please check with him before merging to 
see if he has any requests.

I tested this today against a single-node Cassandra 5.0 cluster with 
storage_compatibility_mode: NONE using the DirectBulkWriter Spark job to 
bulk-load 10 million rows via the sidecar transport. The first run used the 
existing cassandra-analytics library (0.4.0, Spark 3 / Scala 2.12) without the 
fix. The resulting SSTables had Filter.db files of only 16 bytes regardless of 
SSTable size (oa-14 through oa-16, Data.db sizes ranging from 53–97 MB), 
confirming correct bloom filters were not being generated.

A second run was performed with Lukasz's branch, building cassandra-analytics 
against Spark 4 / Scala 2.13 and running on EMR emr-spark-8.0.0. The new 
SSTables (oa-18 through oa-21) showed Filter.db files of 2.2–4.5 MB 
proportional to their Data.db sizes (48–97 MB), confirming that 
rebuildFilterComponents is correctly regenerating bloom filters after 
CQLSSTableWriter completes.

 
||SSTable||Run||Filter.db||Data.db||
|oa-14|Before (Scala 2.12, no fix)|16 bytes|53 MB|
|oa-15|Before (Scala 2.12, no fix)|16 bytes|63 MB|
|oa-16|Before (Scala 2.12, no fix)|16 bytes|97 MB|
|oa-17|After (Spark 4 / Scala 2.13, fix applied)|123 KB|2.7 MB|
|oa-18|After (Spark 4 / Scala 2.13, fix applied)|2.2 MB|48 MB|
|oa-19|After (Spark 4 / Scala 2.13, fix applied)|2.4 MB|53 MB|
|oa-20|After (Spark 4 / Scala 2.13, fix applied)|2.9 MB|63 MB|
|oa-21|After (Spark 4 / Scala 2.13, fix applied)|4.5 MB|97 MB|


was (Author: rustyrazorblade):
+1

I had a couple minor nits in the code but no blockers from me.  I know Yifan 
had at least one or two comments, so please check with him to see if he has any 
requests.

I tested this today against a single-node Cassandra 5.0 cluster with 
storage_compatibility_mode: NONE using the DirectBulkWriter Spark job to 
bulk-load 10 million rows via the sidecar transport. The first run used the 
existing cassandra-analytics library (0.4.0, Spark 3 / Scala 2.12) without the 
fix. The resulting SSTables had Filter.db files of only 16 bytes regardless of 
SSTable size (oa-14 through oa-16, Data.db sizes ranging from 53–97 MB), 
confirming correct bloom filters were not being generated.

A second run was performed with Lukasz's branch, building cassandra-analytics 
against Spark 4 / Scala 2.13 and running on EMR emr-spark-8.0.0. The new 
SSTables (oa-18 through oa-21) showed Filter.db files of 2.2–4.5 MB 
proportional to their Data.db sizes (48–97 MB), confirming that 
rebuildFilterComponents is correctly regenerating bloom filters after 
CQLSSTableWriter completes.

 
||SSTable||Run||Filter.db||Data.db||
|oa-14|Before (Scala 2.12, no fix)|16 bytes|53 MB|
|oa-15|Before (Scala 2.12, no fix)|16 bytes|63 MB|
|oa-16|Before (Scala 2.12, no fix)|16 bytes|97 MB|
|oa-17|After (Spark 4 / Scala 2.13, fix applied)|123 KB|2.7 MB|
|oa-18|After (Spark 4 / Scala 2.13, fix applied)|2.2 MB|48 MB|
|oa-19|After (Spark 4 / Scala 2.13, fix applied)|2.4 MB|53 MB|
|oa-20|After (Spark 4 / Scala 2.13, fix applied)|2.9 MB|63 MB|
|oa-21|After (Spark 4 / Scala 2.13, fix applied)|4.5 MB|97 MB|

> Regenerate Bloom filters for CQLSSTableWriter-produced SSTables before upload
> -----------------------------------------------------------------------------
>
>                 Key: CASSANALYTICS-167
>                 URL: https://issues.apache.org/jira/browse/CASSANALYTICS-167
>             Project: Apache Cassandra Analytics
>          Issue Type: Improvement
>          Components: Writer
>            Reporter: Yifan Cai
>            Assignee: Lukasz Antoniak
>            Priority: Normal
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> CQLSSTableWriter produces empty Filter.db files when flushing SSTables. This 
> causes Cassandra nodes to skip Bloom filter checks on imported SSTables, 
> resulting in unnecessary disk reads for every partition key lookup.
> Fixing CQLSSTableWriter upstream requires a new Cassandra release. As a 
> near-term fix, cassandra-analytics will regenerate correct Bloom filters from 
> the SSTable's Index.db before uploading.
> Proposed changes:                                                             
>                                                                               
>                                                                               
>                       
> - Add rebuildBloomFilter method to CassandraBridge interface with 
> implementations in FourZeroBridge and FiveZeroBridge
> - Call the rebuild in SortedSSTableWriter.close() after the SSTable flush and 
> before digest computation, so digests cover the correct filter   
> Jon Haddad reported the issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to