emkornfield commented on code in PR #9628:
URL: https://github.com/apache/arrow-rs/pull/9628#discussion_r3024559937
##########
parquet/src/bloom_filter/mod.rs:
##########
@@ -431,6 +485,150 @@ impl Sbbf {
self.0.capacity() * std::mem::size_of::<Block>()
}
+ /// Returns the number of blocks in this bloom filter.
+ pub fn num_blocks(&self) -> usize {
+ self.0.len()
+ }
+
+ /// Fold the bloom filter once, halving its size by merging adjacent block
pairs.
+ ///
+ /// This implements an elementary folding operation for Split Block Bloom
+ /// Filters. Each pair of adjacent blocks is combined via bitwise OR:
+ ///
+ /// ```text
+ /// folded[i] = blocks[2*i] | blocks[2*i + 1] for 0 <= i < num_blocks/2
+ /// ```
+ ///
+ /// ## Why adjacent pairs (not halves)?
+ ///
+ /// SBBFs use **multiplicative** hashing for block selection:
+ ///
+ /// ```text
+ /// block_index = ((hash >> 32) * num_blocks) >> 32
+ /// ```
+ ///
+ /// When `num_blocks` is halved, the new index becomes
`floor(original_index / 2)`.
+ /// Therefore blocks `2i` and `2i+1` map to the same position `i` in the
folded filter.
+ ///
+ /// This differs from standard Bloom filter folding, which merges the two
halves
+ /// (`B[i] | B[i + m/2]`) because standard filters use modular hashing
where
+ /// `h(x) mod (m/2)` maps indices `i` and `i + m/2` to the same position.
+ ///
+ /// ## References
+ ///
+ /// 1. Sailhan, F. & Stehr, M-O. "Folding and Unfolding Bloom Filters",
+ /// IEEE iThings 2012. <https://doi.org/10.1109/GreenCom.2012.16>
+ ///
+ /// # Panics
+ ///
+ /// Panics if the filter has fewer than 2 blocks.
+ fn fold_once(&mut self) {
+ let len = self.0.len();
+ assert!(
+ len >= 2,
+ "Cannot fold a bloom filter with fewer than 2 blocks"
+ );
+ assert!(
+ len % 2 == 0,
+ "Cannot fold a bloom filter with an odd number of blocks"
+ );
+ let half = len / 2;
+ for i in 0..half {
+ // Copy both blocks to stack locals before writing. This eliminates
+ // aliasing between self.0[i] and self.0[2*i], allowing the
compiler
+ // to vectorize the OR into SIMD instructions.
+ let merged = self.0[2 * i] | self.0[2 * i + 1];
+ self.0[i] = merged;
+ }
+ self.0.truncate(half);
+ }
+
+ /// Estimate the FPP that would result from folding once, without mutating
the filter.
+ ///
+ /// SBBF membership checks are per-block (`k=8` bit checks within one
256-bit block),
+ /// so FPP is the average per-block false positive probability:
+ ///
+ /// ```text
+ /// FPP = (1 / num_blocks) * sum_i (set_bits_in_block_i / 256)^8
+ /// ```
+ ///
+ /// Separating the OR from the popcount lets the compiler emit vectorized
popcount
+ /// instructions (e.g., `cnt.16b` on ARM NEON) instead of per-word scalar
popcount.
+ fn estimated_fpp_after_fold(&self) -> f64 {
+ let half = self.0.len() as f64 / 2.0;
+ let mut total_fpp: f64 = 0.0;
+ for pair in self.0.chunks_exact(2) {
Review Comment:
nit: either have a remainder block here, or assert len() is evenly divisible
by 2?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]