andygrove opened a new pull request, #3813:
URL: https://github.com/apache/datafusion-comet/pull/3813
## Which issue does this PR close?
Closes #2889.
## Rationale for this change
When Spark performs the partial aggregate and Comet performs the final
aggregate for `bloom_filter_agg`, the intermediate buffer formats are
incompatible. Spark's `evaluate()` produces a serialized format with a 12-byte
big-endian header (version + num_hash_functions + num_words) followed by
big-endian bit data, while Comet's `merge_batch` expects raw bits in native
endianness with no header. This causes a panic:
```
assertion `left == right` failed: Cannot merge SparkBloomFilters with
different lengths.
left: 1048588
right: 1048576
```
The 12-byte difference is exactly the Spark serialization header size.
## What changes are included in this PR?
Updated `SparkBloomFilter::merge_filter()` to detect and handle both formats:
- **Raw bits** (Comet partial → Comet final): merged directly as before
- **Spark serialization format** (Spark partial → Comet final): strips the
12-byte header and converts big-endian to native endianness before merging
## How are these changes tested?
Added 6 unit tests covering:
- Comet-to-Comet merge via raw bits (`test_merge_comet_state_format`)
- Spark-to-Comet merge via Spark serialization format
(`test_merge_spark_serialization_format`) — reproduces the #2889 scenario
- End-to-end `Accumulator::merge_batch` with both formats
- Spark serialization roundtrip
- Invalid buffer length error case
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]