alamb commented on code in PR #9628:
URL: https://github.com/apache/arrow-rs/pull/9628#discussion_r3021977094


##########
parquet/src/bloom_filter/mod.rs:
##########
@@ -431,6 +449,165 @@ impl Sbbf {
         self.0.capacity() * std::mem::size_of::<Block>()
     }
 
+    /// Returns the number of blocks in this bloom filter.
+    pub fn num_blocks(&self) -> usize {
+        self.0.len()
+    }
+
+    /// Fold the bloom filter once, halving its size by merging adjacent block 
pairs.
+    ///
+    /// This implements an elementary folding operation for Split Block Bloom
+    /// Filters. Each pair of adjacent blocks is combined via bitwise OR:
+    ///
+    /// ```text
+    /// folded[i] = blocks[2*i] | blocks[2*i + 1]    for 0 <= i < num_blocks/2
+    /// ```
+    ///
+    /// ## Why adjacent pairs (not halves)?
+    ///
+    /// SBBFs use **multiplicative** hashing for block selection:
+    ///
+    /// ```text
+    /// block_index = ((hash >> 32) * num_blocks) >> 32
+    /// ```
+    ///
+    /// When `num_blocks` is halved, the new index becomes 
`floor(original_index / 2)`.
+    /// Therefore blocks `2i` and `2i+1` map to the same position `i` in the 
folded filter.
+    ///
+    /// This differs from standard Bloom filter folding, which merges the two 
halves
+    /// (`B[i] | B[i + m/2]`) because standard filters use modular hashing 
where
+    /// `h(x) mod (m/2)` maps indices `i` and `i + m/2` to the same position.
+    ///
+    /// ## References
+    ///
+    /// 1. Sailhan, F. & Stehr, M-O. "Folding and Unfolding Bloom Filters",
+    ///    IEEE iThings 2012. <https://doi.org/10.1109/GreenCom.2012.16>
+    ///
+    /// # Panics
+    ///
+    /// Panics if the filter has fewer than 2 blocks.
+    fn fold_once(&mut self) {
+        let len = self.0.len();
+        assert!(
+            len >= 2,
+            "Cannot fold a bloom filter with fewer than 2 blocks"
+        );
+        assert!(
+            len % 2 == 0,
+            "Cannot fold a bloom filter with an odd number of blocks"
+        );
+        let half = len / 2;
+        for i in 0..half {
+            for j in 0..8 {
+                self.0[i].0[j] = self.0[2 * i].0[j] | self.0[2 * i + 1].0[j];
+            }
+        }
+        self.0.truncate(half);
+    }
+
+    /// Estimate the FPP that would result from folding once, without mutating 
the filter.
+    ///
+    /// Unlike standard Bloom filters where FPP depends on the global fill 
ratio, SBBF
+    /// membership checks are **per-block**: a query hashes to exactly one 
block, then checks
+    /// `k=8` bits within that block. The FPP is therefore the **average of 
per-block FPPs**:
+    ///
+    /// ```text
+    /// FPP = (1 / num_blocks) * sum_i (set_bits_in_block_i / 256)^8
+    /// ```
+    ///
+    /// To project the FPP after a fold, we simulate the merge of each 
adjacent pair `(2i, 2i+1)`
+    /// by computing `(block[2i] | block[2i+1]).count_ones()` without actually 
mutating the filter.
+    fn estimated_fpp_after_fold(&self) -> f64 {
+        let half = self.0.len() / 2;
+        let mut total_fpp: f64 = 0.0;
+        for i in 0..half {
+            let mut set_bits: u32 = 0;
+            for j in 0..8 {
+                set_bits += (self.0[2 * i].0[j] | self.0[2 * i + 
1].0[j]).count_ones();
+            }
+            let block_fill = f64::from(set_bits) / 256.0;
+            total_fpp += block_fill.powi(8);
+        }
+        total_fpp / half as f64
+    }
+
+    /// Fold the bloom filter down to the smallest size that still meets the 
target FPP.

Review Comment:
   It might help to redundantly define FPP on this function
   
   ```suggestion
       /// Fold the bloom filter down to the smallest size that still meets the 
target FPP 
       /// (False Positive Percentage)
   ```



##########
parquet/src/bloom_filter/mod.rs:
##########
@@ -431,6 +449,165 @@ impl Sbbf {
         self.0.capacity() * std::mem::size_of::<Block>()
     }
 
+    /// Returns the number of blocks in this bloom filter.
+    pub fn num_blocks(&self) -> usize {
+        self.0.len()
+    }
+
+    /// Fold the bloom filter once, halving its size by merging adjacent block 
pairs.
+    ///
+    /// This implements an elementary folding operation for Split Block Bloom
+    /// Filters. Each pair of adjacent blocks is combined via bitwise OR:
+    ///
+    /// ```text
+    /// folded[i] = blocks[2*i] | blocks[2*i + 1]    for 0 <= i < num_blocks/2
+    /// ```
+    ///
+    /// ## Why adjacent pairs (not halves)?
+    ///
+    /// SBBFs use **multiplicative** hashing for block selection:
+    ///
+    /// ```text
+    /// block_index = ((hash >> 32) * num_blocks) >> 32
+    /// ```
+    ///
+    /// When `num_blocks` is halved, the new index becomes 
`floor(original_index / 2)`.
+    /// Therefore blocks `2i` and `2i+1` map to the same position `i` in the 
folded filter.
+    ///
+    /// This differs from standard Bloom filter folding, which merges the two 
halves
+    /// (`B[i] | B[i + m/2]`) because standard filters use modular hashing 
where
+    /// `h(x) mod (m/2)` maps indices `i` and `i + m/2` to the same position.
+    ///
+    /// ## References
+    ///
+    /// 1. Sailhan, F. & Stehr, M-O. "Folding and Unfolding Bloom Filters",
+    ///    IEEE iThings 2012. <https://doi.org/10.1109/GreenCom.2012.16>
+    ///
+    /// # Panics
+    ///
+    /// Panics if the filter has fewer than 2 blocks.
+    fn fold_once(&mut self) {
+        let len = self.0.len();
+        assert!(
+            len >= 2,
+            "Cannot fold a bloom filter with fewer than 2 blocks"
+        );
+        assert!(
+            len % 2 == 0,
+            "Cannot fold a bloom filter with an odd number of blocks"
+        );
+        let half = len / 2;
+        for i in 0..half {
+            for j in 0..8 {
+                self.0[i].0[j] = self.0[2 * i].0[j] | self.0[2 * i + 1].0[j];

Review Comment:
   FYI this code seems very expensive in profiling
   
   <img width="1452" height="1268" alt="Image" 
src="https://github.com/user-attachments/assets/9f94a487-319a-4890-ac25-0815c82f7cbc";
 />
   
   It also consumes a huge amount of time even in release builds
   <img width="1456" height="247" alt="Image" 
src="https://github.com/user-attachments/assets/b134c0e7-3938-4de9-a17a-b62ed06265be";
 />
   
   I wonder if there is some way to make this faster (unchecked access / rust 
iterators, etc)



##########
parquet/src/file/properties.rs:
##########
@@ -1201,28 +1213,37 @@ pub struct BloomFilterProperties {
     /// smaller the fpp, the more memory and disk space is required, thus 
setting it to a reasonable value
     /// e.g. 0.1, 0.05, or 0.001 is recommended.
     ///
-    /// Setting to a very small number diminishes the value of the filter 
itself, as the bitset size is
-    /// even larger than just storing the whole value. You are also expected 
to set `ndv` if it can
-    /// be known in advance to greatly reduce space usage.
+    /// This value also serves as the target FPP for bloom filter folding: 
after all values
+    /// are inserted, the filter is folded down to the smallest size that 
still meets this FPP.
     pub fpp: f64,
-    /// Number of distinct values, should be non-negative to be meaningful. 
Defaults to [`DEFAULT_BLOOM_FILTER_NDV`].
+    /// Maximum expected number of distinct values. When `None` (default), the 
bloom filter
+    /// is sized based on the row group's `max_row_group_row_count` at runtime.
     ///
     /// You should set this value by calling 
[`WriterPropertiesBuilder::set_bloom_filter_ndv`].
     ///
-    /// Usage of bloom filter is most beneficial for columns with large 
cardinality, so a good heuristic
-    /// is to set ndv to the number of rows. However, it can reduce disk size 
if you know in advance a smaller
-    /// number of distinct values. For very small ndv value it is probably not 
worth it to use bloom filter
-    /// anyway.
-    ///
-    /// Increasing this value (without increasing fpp) will result in an 
increase in disk or memory size.
-    pub ndv: u64,
+    /// The bloom filter is initially sized for this many distinct values at 
the given `fpp`,
+    /// then folded down after insertion to achieve optimal size. A good 
heuristic is to set
+    /// this to the expected number of rows in the row group. If fewer 
distinct values are
+    /// actually written, the filter will be automatically compacted via 
folding.
+    ///
+    /// Thus the only negative side of overestimating this value is that the 
bloom filter
+    /// will use more memory during writing than necessary, but it will not 
affect the final
+    /// bloom filter size on disk.
+    ///
+    /// If you wish to reduce memory usage during writing and are able to make 
a reasonable estimate
+    /// of the number of distinct values in a row group, it is recommended to 
set this value explicitly
+    /// rather than relying on the default dynamic sizing based on 
`max_row_group_row_count`.
+    /// If you do set this value explicitly it is probably best to set it for 
each column
+    /// individually via 
[`WriterPropertiesBuilder::set_column_bloom_filter_ndv`] rather than globally,
+    /// since different columns may have different numbers of distinct values.
+    pub ndv: Option<u64>,

Review Comment:
   since this is a public struct, I think this is a breaking API change: 
https://docs.rs/parquet/58.0.0/parquet/file/properties/struct.BloomFilterProperties.html
   
   



##########
parquet/src/file/properties.rs:
##########
@@ -1201,28 +1213,37 @@ pub struct BloomFilterProperties {
     /// smaller the fpp, the more memory and disk space is required, thus 
setting it to a reasonable value
     /// e.g. 0.1, 0.05, or 0.001 is recommended.
     ///
-    /// Setting to a very small number diminishes the value of the filter 
itself, as the bitset size is
-    /// even larger than just storing the whole value. You are also expected 
to set `ndv` if it can
-    /// be known in advance to greatly reduce space usage.
+    /// This value also serves as the target FPP for bloom filter folding: 
after all values

Review Comment:
   This is a really cool feature -- to make bloom filter creation dynamically 
adjust the bloom filter size based on the data. I really like that there is a 
way to go back to the old explicit behavior too -- and solves a problem for 
using bloom filters in Parquet. 
   
   As mentioned below this would be a breaking API change so would have to wait 
for the next release
   
   Given this is a pub struct with pub fields, I am not sure there is any way 
to avoid a breaking API change 🤔 
   
   One thing I think that would make the API  could potentially do would be to 
make a builder
   ```rust
   struct BloomFilterPropertiesBuilder { 
   ...
   }
   
   ```
   
   And change the fields of `BloomFilterProperties` to not be public 🤔 
   
   (we could do this as a follow on PR)
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to