Re: [PR] Add bloom filter folding to automatically size SBBF filters [arrow-rs]

via GitHub Wed, 01 Apr 2026 16:58:27 -0700


emkornfield commented on code in PR #9628:
URL: https://github.com/apache/arrow-rs/pull/9628#discussion_r3024581640



##########
parquet/src/bloom_filter/mod.rs:
##########
@@ -431,6 +485,150 @@ impl Sbbf {
         self.0.capacity() * std::mem::size_of::<Block>()
     }
 
+    /// Returns the number of blocks in this bloom filter.
+    pub fn num_blocks(&self) -> usize {
+        self.0.len()
+    }
+
+    /// Fold the bloom filter once, halving its size by merging adjacent block 
pairs.
+    ///
+    /// This implements an elementary folding operation for Split Block Bloom
+    /// Filters. Each pair of adjacent blocks is combined via bitwise OR:
+    ///
+    /// ```text
+    /// folded[i] = blocks[2*i] | blocks[2*i + 1]    for 0 <= i < num_blocks/2
+    /// ```
+    ///
+    /// ## Why adjacent pairs (not halves)?
+    ///
+    /// SBBFs use **multiplicative** hashing for block selection:
+    ///
+    /// ```text
+    /// block_index = ((hash >> 32) * num_blocks) >> 32
+    /// ```
+    ///
+    /// When `num_blocks` is halved, the new index becomes 
`floor(original_index / 2)`.
+    /// Therefore blocks `2i` and `2i+1` map to the same position `i` in the 
folded filter.
+    ///
+    /// This differs from standard Bloom filter folding, which merges the two 
halves
+    /// (`B[i] | B[i + m/2]`) because standard filters use modular hashing 
where
+    /// `h(x) mod (m/2)` maps indices `i` and `i + m/2` to the same position.
+    ///
+    /// ## References
+    ///
+    /// 1. Sailhan, F. & Stehr, M-O. "Folding and Unfolding Bloom Filters",
+    ///    IEEE iThings 2012. <https://doi.org/10.1109/GreenCom.2012.16>
+    ///
+    /// # Panics
+    ///
+    /// Panics if the filter has fewer than 2 blocks.
+    fn fold_once(&mut self) {
+        let len = self.0.len();
+        assert!(
+            len >= 2,
+            "Cannot fold a bloom filter with fewer than 2 blocks"
+        );
+        assert!(
+            len % 2 == 0,
+            "Cannot fold a bloom filter with an odd number of blocks"
+        );
+        let half = len / 2;
+        for i in 0..half {
+            // Copy both blocks to stack locals before writing. This eliminates
+            // aliasing between self.0[i] and self.0[2*i], allowing the 
compiler
+            // to vectorize the OR into SIMD instructions.
+            let merged = self.0[2 * i] | self.0[2 * i + 1];
+            self.0[i] = merged;
+        }
+        self.0.truncate(half);
+    }
+
+    /// Estimate the FPP that would result from folding once, without mutating 
the filter.
+    ///
+    /// SBBF membership checks are per-block (`k=8` bit checks within one 
256-bit block),
+    /// so FPP is the average per-block false positive probability:
+    ///
+    /// ```text
+    /// FPP = (1 / num_blocks) * sum_i (set_bits_in_block_i / 256)^8
+    /// ```
+    ///
+    /// Separating the OR from the popcount lets the compiler emit vectorized 
popcount
+    /// instructions (e.g., `cnt.16b` on ARM NEON) instead of per-word scalar 
popcount.
+    fn estimated_fpp_after_fold(&self) -> f64 {
+        let half = self.0.len() as f64 / 2.0;
+        let mut total_fpp: f64 = 0.0;
+        for pair in self.0.chunks_exact(2) {
+            let merged = pair[0] | pair[1];
+            let set_bits = merged.count_ones();
+            let block_fill = f64::from(set_bits) / 256.0;
+            total_fpp += block_fill.powi(8);
+        }
+        total_fpp / half
+    }
+
+    /// Fold the bloom filter down to the smallest size that still meets the 
target FPP
+    /// (False Positive Percentage)
+    ///
+    /// Repeatedly halves the filter by merging adjacent block pairs via 
bitwise OR,
+    /// stopping when the next fold would cause the estimated FPP to exceed 
`target_fpp`, or
+    /// when the filter reaches the minimum size of 1 block (32 bytes).
+    ///
+    /// ## How it works
+    ///
+    /// SBBFs use multiplicative hashing for block selection:
+    ///
+    /// ```text
+    /// block_index = ((hash >> 32) * num_blocks) >> 32
+    /// ```
+    ///
+    /// When `num_blocks` is halved, the new index becomes 
`floor(original_index / 2)`, so
+    /// blocks `2i` and `2i+1` map to the same position. Each fold merges 
**adjacent** pairs:
+    ///
+    /// ```text
+    /// folded[i] = blocks[2*i] | blocks[2*i + 1]
+    /// ```
+    ///
+    /// The FPP after a fold is estimated as the average per-block false 
positive probability:
+    ///
+    /// ```text
+    /// FPP = (1/b) * sum_i (set_bits_in_block_i / 256)^8
+    /// ```
+    ///
+    /// ## Implementation
+    ///
+    /// Each iteration does two passes over the block pairs:
+    /// 1. **Read-only FPP estimate**: OR each pair into a stack-local 
`Block`, popcount it,
+    ///    and accumulate the per-block FPP. If the estimated FPP exceeds the 
target, stop.
+    /// 2. **In-place fold**: OR pairs and write results to the first half of 
the existing Vec,
+    ///    then truncate. This reuses the existing allocation — no heap 
allocation per fold level.
+    ///
+    /// The data from pass 1 remains hot in CPU cache for pass 2. Separating 
the OR from the
+    /// popcount (via `Block::count_ones`) allows the compiler to emit batched 
SIMD popcount
+    /// instructions.
+    ///
+    /// ## Correctness
+    ///
+    /// Folding **never introduces false negatives**. Every bit that was set 
in the original
+    /// filter remains set in the folded filter (via bitwise OR). The only 
effect is a controlled
+    /// increase in FPP as set bits from different blocks are merged together.
+    ///
+    /// ## References
+    ///
+    /// - Sailhan, F. & Stehr, M-O. "Folding and Unfolding Bloom Filters",
+    ///   IEEE iThings 2012. <https://doi.org/10.1109/GreenCom.2012.16>
+    pub fn fold_to_target_fpp(&mut self, target_fpp: f64) {
+        while self.0.len() >= 2 {
+            // Pass 1: Read-only FPP estimate. If the fold would exceed the
+            // target FPP, stop before mutating the filter.
+            if self.estimated_fpp_after_fold() > target_fpp {

Review Comment:
   for benchmarking, how much is the estimated fpp an issue vs the actual 
folding?  Not necessarily for this PR, but it seems like it could be possible 
to binary search for the target number of folds, and then do all folding in a 
single pass (one could probably just do all folding in a single pass anyways if 
we didn't want do the binary search).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Add bloom filter folding to automatically size SBBF filters [arrow-rs]

Reply via email to