[PR] WIP: fix bloom filter buffer incompatibility between Spark and Comet [datafusion-comet]

via GitHub Fri, 27 Mar 2026 12:05:33 -0700


andygrove opened a new pull request, #3813:
URL: https://github.com/apache/datafusion-comet/pull/3813


   ## Which issue does this PR close?
   
   Closes #2889.
   
   ## Rationale for this change
   
   When Spark performs the partial aggregate and Comet performs the final 
aggregate for `bloom_filter_agg`, the intermediate buffer formats are 
incompatible. Spark's `evaluate()` produces a serialized format with a 12-byte 
big-endian header (version + num_hash_functions + num_words) followed by 
big-endian bit data, while Comet's `merge_batch` expects raw bits in native 
endianness with no header. This causes a panic:
   
   ```
   assertion `left == right` failed: Cannot merge SparkBloomFilters with 
different lengths.
     left: 1048588
    right: 1048576
   ```
   
   The 12-byte difference is exactly the Spark serialization header size.
   
   ## What changes are included in this PR?
   
   Updated `SparkBloomFilter::merge_filter()` to detect and handle both formats:
   - **Raw bits** (Comet partial → Comet final): merged directly as before
   - **Spark serialization format** (Spark partial → Comet final): strips the 
12-byte header and converts big-endian to native endianness before merging
   
   ## How are these changes tested?
   
   Added 6 unit tests covering:
   - Comet-to-Comet merge via raw bits (`test_merge_comet_state_format`)
   - Spark-to-Comet merge via Spark serialization format 
(`test_merge_spark_serialization_format`) — reproduces the #2889 scenario
   - End-to-end `Accumulator::merge_batch` with both formats
   - Spark serialization roundtrip
   - Invalid buffer length error case


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] WIP: fix bloom filter buffer incompatibility between Spark and Comet [datafusion-comet]

Reply via email to