adriangb opened a new pull request, #9628: URL: https://github.com/apache/arrow-rs/pull/9628
## Summary - Bloom filters now support **folding mode**: allocate a conservatively large filter (sized for worst-case NDV = max row group rows), insert all values during writing, then fold down at flush time to meet a target FPP. This eliminates the need to guess NDV upfront. - When `ndv` is not explicitly set (the new default), folding mode activates automatically. Setting `ndv` explicitly via `set_bloom_filter_ndv()` preserves the existing fixed-size behavior. - `BloomFilterProperties.ndv` changed from `u64` to `Option<u64>` (`None` = folding mode) - Added `BloomFilterProperties.max_bytes` and `set_bloom_filter_max_bytes()` for explicit initial size control - Default FPP changed from `0.05` to `0.01` - Default initial filter size is derived from `max_row_group_row_count` + `fpp` (1MiB for defaults) ### How folding works The SBBF fold operation merges adjacent block pairs (`block[2i] | block[2i+1]`) via bitwise OR, halving the filter size. This is correct because `hash_to_block_index` maps to `floor(original_index / 2)` when `num_blocks` is halved. `fold_to_target_fpp()` repeats this until the next fold would exceed the target FPP, estimated per-block as `avg(block_fill^8)`. ### Breaking changes - `BloomFilterProperties.ndv`: `u64` → `Option<u64>` (direct struct construction must be updated) - `DEFAULT_BLOOM_FILTER_FPP`: `0.05` → `0.01` - `DEFAULT_BLOOM_FILTER_NDV`: deprecated ## Test plan - [x] All existing bloom filter unit tests pass - [x] All existing integration tests (sync + async reader roundtrips) pass - [x] New unit tests: fold correctness, no false negatives after folding, FPP target respected, minimum size guard - [x] Full `cargo test -p parquet` passes (1165 tests, 0 failures) 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
