emkornfield commented on code in PR #9628:
URL: https://github.com/apache/arrow-rs/pull/9628#discussion_r3012735678


##########
parquet/src/bloom_filter/mod.rs:
##########
@@ -431,6 +451,162 @@ impl Sbbf {
         self.0.capacity() * std::mem::size_of::<Block>()
     }
 
+    /// Returns the number of blocks in this bloom filter.
+    pub fn num_blocks(&self) -> usize {
+        self.0.len()
+    }
+
+    /// Fold the bloom filter once, halving its size by merging adjacent block 
pairs.
+    ///
+    /// This implements an elementary folding operation for Split Block Bloom
+    /// Filters. Each pair of adjacent blocks is combined via bitwise OR:
+    ///
+    /// ```text
+    /// folded[i] = blocks[2*i] | blocks[2*i + 1]    for 0 <= i < num_blocks/2
+    /// ```
+    ///
+    /// ## Why adjacent pairs (not halves)?
+    ///
+    /// Standard Bloom filter folding merges the two halves (`B[i] | B[i + 
m/2]`) because
+    /// standard filters use modular hashing: `index = h(x) mod m`, so `h(x) 
mod (m/2)`
+    /// maps index `i` and index `i + m/2` to the same position.
+    ///
+    /// SBBFs use **multiplicative** hashing for block selection:
+    ///
+    /// ```text
+    /// block_index = ((hash >> 32) * num_blocks) >> 32
+    /// ```
+    ///
+    /// When `num_blocks` is halved, the new index becomes 
`floor(original_index / 2)`.
+    /// Therefore blocks `2i` and `2i+1` (not `i` and `i + N/2`) map to the 
same position `i`
+    /// in the folded filter.
+    ///
+    /// ## References
+    ///
+    /// 1. Sailhan, F. & Stehr, M-O. "Folding and Unfolding Bloom Filters",
+    ///    IEEE iThings 2012. <https://doi.org/10.1109/GreenCom.2012.16>
+    ///
+    /// # Panics
+    ///
+    /// Panics if the filter has fewer than 2 blocks.
+    fn fold_once(&mut self) {
+        let len = self.0.len();
+        assert!(
+            len >= 2,
+            "Cannot fold a bloom filter with fewer than 2 blocks"
+        );
+        let half = len / 2;
+        for i in 0..half {
+            for j in 0..8 {
+                self.0[i].0[j] = self.0[2 * i].0[j] | self.0[2 * i + 1].0[j];
+            }
+        }
+        self.0.truncate(half);
+    }
+
+    /// Estimate the FPP that would result from folding once, without mutating 
the filter.
+    ///
+    /// Unlike standard Bloom filters where FPP depends on the global fill 
ratio, SBBF
+    /// membership checks are **per-block**: a query hashes to exactly one 
block, then checks
+    /// `k=8` bits within that block. The FPP is therefore the **average of 
per-block FPPs**:
+    ///
+    /// ```text
+    /// FPP = (1 / num_blocks) * sum_i (set_bits_in_block_i / 256)^8
+    /// ```
+    ///
+    /// To project the FPP after a fold, we simulate the merge of each 
adjacent pair `(2i, 2i+1)`
+    /// by computing `(block[2i] | block[2i+1]).count_ones()` without actually 
mutating the filter.
+    fn estimated_fpp_after_fold(&self) -> f64 {
+        let half = self.0.len() / 2;
+        let mut total_fpp = 0.0;
+        for i in 0..half {
+            let mut set_bits: u32 = 0;
+            for j in 0..8 {
+                set_bits += (self.0[2 * i].0[j] | self.0[2 * i + 
1].0[j]).count_ones();
+            }
+            let block_fill = set_bits as f64 / 256.0;
+            total_fpp += block_fill.powi(8);
+        }
+        total_fpp / half as f64

Review Comment:
   why is cast needed here, can this be avoided by setting the type explicitly 
on total_fpp?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to