adriangb commented on code in PR #9628:
URL: https://github.com/apache/arrow-rs/pull/9628#discussion_r3022702713


##########
parquet/src/file/properties.rs:
##########
@@ -1201,28 +1213,37 @@ pub struct BloomFilterProperties {
     /// smaller the fpp, the more memory and disk space is required, thus 
setting it to a reasonable value
     /// e.g. 0.1, 0.05, or 0.001 is recommended.
     ///
-    /// Setting to a very small number diminishes the value of the filter 
itself, as the bitset size is
-    /// even larger than just storing the whole value. You are also expected 
to set `ndv` if it can
-    /// be known in advance to greatly reduce space usage.
+    /// This value also serves as the target FPP for bloom filter folding: 
after all values

Review Comment:
   I kind of went down this path and then backed out to minimize the diff / 
change. I would propose we merge this and then refactor into 
`BloomFilterPropertiesBuilder` to ease future changes.



##########
parquet/src/bloom_filter/mod.rs:
##########
@@ -431,6 +449,165 @@ impl Sbbf {
         self.0.capacity() * std::mem::size_of::<Block>()
     }
 
+    /// Returns the number of blocks in this bloom filter.
+    pub fn num_blocks(&self) -> usize {
+        self.0.len()
+    }
+
+    /// Fold the bloom filter once, halving its size by merging adjacent block 
pairs.
+    ///
+    /// This implements an elementary folding operation for Split Block Bloom
+    /// Filters. Each pair of adjacent blocks is combined via bitwise OR:
+    ///
+    /// ```text
+    /// folded[i] = blocks[2*i] | blocks[2*i + 1]    for 0 <= i < num_blocks/2
+    /// ```
+    ///
+    /// ## Why adjacent pairs (not halves)?
+    ///
+    /// SBBFs use **multiplicative** hashing for block selection:
+    ///
+    /// ```text
+    /// block_index = ((hash >> 32) * num_blocks) >> 32
+    /// ```
+    ///
+    /// When `num_blocks` is halved, the new index becomes 
`floor(original_index / 2)`.
+    /// Therefore blocks `2i` and `2i+1` map to the same position `i` in the 
folded filter.
+    ///
+    /// This differs from standard Bloom filter folding, which merges the two 
halves
+    /// (`B[i] | B[i + m/2]`) because standard filters use modular hashing 
where
+    /// `h(x) mod (m/2)` maps indices `i` and `i + m/2` to the same position.
+    ///
+    /// ## References
+    ///
+    /// 1. Sailhan, F. & Stehr, M-O. "Folding and Unfolding Bloom Filters",
+    ///    IEEE iThings 2012. <https://doi.org/10.1109/GreenCom.2012.16>
+    ///
+    /// # Panics
+    ///
+    /// Panics if the filter has fewer than 2 blocks.
+    fn fold_once(&mut self) {
+        let len = self.0.len();
+        assert!(
+            len >= 2,
+            "Cannot fold a bloom filter with fewer than 2 blocks"
+        );
+        assert!(
+            len % 2 == 0,
+            "Cannot fold a bloom filter with an odd number of blocks"
+        );
+        let half = len / 2;
+        for i in 0..half {
+            for j in 0..8 {
+                self.0[i].0[j] = self.0[2 * i].0[j] | self.0[2 * i + 1].0[j];

Review Comment:
   I will see what I can do



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to