alamb commented on code in PR #9628:
URL: https://github.com/apache/arrow-rs/pull/9628#discussion_r3021977094
##########
parquet/src/bloom_filter/mod.rs:
##########
@@ -431,6 +449,165 @@ impl Sbbf {
self.0.capacity() * std::mem::size_of::<Block>()
}
+ /// Returns the number of blocks in this bloom filter.
+ pub fn num_blocks(&self) -> usize {
+ self.0.len()
+ }
+
+ /// Fold the bloom filter once, halving its size by merging adjacent block
pairs.
+ ///
+ /// This implements an elementary folding operation for Split Block Bloom
+ /// Filters. Each pair of adjacent blocks is combined via bitwise OR:
+ ///
+ /// ```text
+ /// folded[i] = blocks[2*i] | blocks[2*i + 1] for 0 <= i < num_blocks/2
+ /// ```
+ ///
+ /// ## Why adjacent pairs (not halves)?
+ ///
+ /// SBBFs use **multiplicative** hashing for block selection:
+ ///
+ /// ```text
+ /// block_index = ((hash >> 32) * num_blocks) >> 32
+ /// ```
+ ///
+ /// When `num_blocks` is halved, the new index becomes
`floor(original_index / 2)`.
+ /// Therefore blocks `2i` and `2i+1` map to the same position `i` in the
folded filter.
+ ///
+ /// This differs from standard Bloom filter folding, which merges the two
halves
+ /// (`B[i] | B[i + m/2]`) because standard filters use modular hashing
where
+ /// `h(x) mod (m/2)` maps indices `i` and `i + m/2` to the same position.
+ ///
+ /// ## References
+ ///
+ /// 1. Sailhan, F. & Stehr, M-O. "Folding and Unfolding Bloom Filters",
+ /// IEEE iThings 2012. <https://doi.org/10.1109/GreenCom.2012.16>
+ ///
+ /// # Panics
+ ///
+ /// Panics if the filter has fewer than 2 blocks.
+ fn fold_once(&mut self) {
+ let len = self.0.len();
+ assert!(
+ len >= 2,
+ "Cannot fold a bloom filter with fewer than 2 blocks"
+ );
+ assert!(
+ len % 2 == 0,
+ "Cannot fold a bloom filter with an odd number of blocks"
+ );
+ let half = len / 2;
+ for i in 0..half {
+ for j in 0..8 {
+ self.0[i].0[j] = self.0[2 * i].0[j] | self.0[2 * i + 1].0[j];
+ }
+ }
+ self.0.truncate(half);
+ }
+
+ /// Estimate the FPP that would result from folding once, without mutating
the filter.
+ ///
+ /// Unlike standard Bloom filters where FPP depends on the global fill
ratio, SBBF
+ /// membership checks are **per-block**: a query hashes to exactly one
block, then checks
+ /// `k=8` bits within that block. The FPP is therefore the **average of
per-block FPPs**:
+ ///
+ /// ```text
+ /// FPP = (1 / num_blocks) * sum_i (set_bits_in_block_i / 256)^8
+ /// ```
+ ///
+ /// To project the FPP after a fold, we simulate the merge of each
adjacent pair `(2i, 2i+1)`
+ /// by computing `(block[2i] | block[2i+1]).count_ones()` without actually
mutating the filter.
+ fn estimated_fpp_after_fold(&self) -> f64 {
+ let half = self.0.len() / 2;
+ let mut total_fpp: f64 = 0.0;
+ for i in 0..half {
+ let mut set_bits: u32 = 0;
+ for j in 0..8 {
+ set_bits += (self.0[2 * i].0[j] | self.0[2 * i +
1].0[j]).count_ones();
+ }
+ let block_fill = f64::from(set_bits) / 256.0;
+ total_fpp += block_fill.powi(8);
+ }
+ total_fpp / half as f64
+ }
+
+ /// Fold the bloom filter down to the smallest size that still meets the
target FPP.
Review Comment:
It might help to redundantly define FPP on this function
```suggestion
/// Fold the bloom filter down to the smallest size that still meets the
target FPP
/// (False Positive Percentage)
```
##########
parquet/src/bloom_filter/mod.rs:
##########
@@ -431,6 +449,165 @@ impl Sbbf {
self.0.capacity() * std::mem::size_of::<Block>()
}
+ /// Returns the number of blocks in this bloom filter.
+ pub fn num_blocks(&self) -> usize {
+ self.0.len()
+ }
+
+ /// Fold the bloom filter once, halving its size by merging adjacent block
pairs.
+ ///
+ /// This implements an elementary folding operation for Split Block Bloom
+ /// Filters. Each pair of adjacent blocks is combined via bitwise OR:
+ ///
+ /// ```text
+ /// folded[i] = blocks[2*i] | blocks[2*i + 1] for 0 <= i < num_blocks/2
+ /// ```
+ ///
+ /// ## Why adjacent pairs (not halves)?
+ ///
+ /// SBBFs use **multiplicative** hashing for block selection:
+ ///
+ /// ```text
+ /// block_index = ((hash >> 32) * num_blocks) >> 32
+ /// ```
+ ///
+ /// When `num_blocks` is halved, the new index becomes
`floor(original_index / 2)`.
+ /// Therefore blocks `2i` and `2i+1` map to the same position `i` in the
folded filter.
+ ///
+ /// This differs from standard Bloom filter folding, which merges the two
halves
+ /// (`B[i] | B[i + m/2]`) because standard filters use modular hashing
where
+ /// `h(x) mod (m/2)` maps indices `i` and `i + m/2` to the same position.
+ ///
+ /// ## References
+ ///
+ /// 1. Sailhan, F. & Stehr, M-O. "Folding and Unfolding Bloom Filters",
+ /// IEEE iThings 2012. <https://doi.org/10.1109/GreenCom.2012.16>
+ ///
+ /// # Panics
+ ///
+ /// Panics if the filter has fewer than 2 blocks.
+ fn fold_once(&mut self) {
+ let len = self.0.len();
+ assert!(
+ len >= 2,
+ "Cannot fold a bloom filter with fewer than 2 blocks"
+ );
+ assert!(
+ len % 2 == 0,
+ "Cannot fold a bloom filter with an odd number of blocks"
+ );
+ let half = len / 2;
+ for i in 0..half {
+ for j in 0..8 {
+ self.0[i].0[j] = self.0[2 * i].0[j] | self.0[2 * i + 1].0[j];
Review Comment:
FYI this code seems very expensive in profiling
<img width="1452" height="1268" alt="Image"
src="https://github.com/user-attachments/assets/9f94a487-319a-4890-ac25-0815c82f7cbc"
/>
It also consumes a huge amount of time even in release builds
<img width="1456" height="247" alt="Image"
src="https://github.com/user-attachments/assets/b134c0e7-3938-4de9-a17a-b62ed06265be"
/>
I wonder if there is some way to make this faster (unchecked access / rust
iterators, etc)
##########
parquet/src/file/properties.rs:
##########
@@ -1201,28 +1213,37 @@ pub struct BloomFilterProperties {
/// smaller the fpp, the more memory and disk space is required, thus
setting it to a reasonable value
/// e.g. 0.1, 0.05, or 0.001 is recommended.
///
- /// Setting to a very small number diminishes the value of the filter
itself, as the bitset size is
- /// even larger than just storing the whole value. You are also expected
to set `ndv` if it can
- /// be known in advance to greatly reduce space usage.
+ /// This value also serves as the target FPP for bloom filter folding:
after all values
+ /// are inserted, the filter is folded down to the smallest size that
still meets this FPP.
pub fpp: f64,
- /// Number of distinct values, should be non-negative to be meaningful.
Defaults to [`DEFAULT_BLOOM_FILTER_NDV`].
+ /// Maximum expected number of distinct values. When `None` (default), the
bloom filter
+ /// is sized based on the row group's `max_row_group_row_count` at runtime.
///
/// You should set this value by calling
[`WriterPropertiesBuilder::set_bloom_filter_ndv`].
///
- /// Usage of bloom filter is most beneficial for columns with large
cardinality, so a good heuristic
- /// is to set ndv to the number of rows. However, it can reduce disk size
if you know in advance a smaller
- /// number of distinct values. For very small ndv value it is probably not
worth it to use bloom filter
- /// anyway.
- ///
- /// Increasing this value (without increasing fpp) will result in an
increase in disk or memory size.
- pub ndv: u64,
+ /// The bloom filter is initially sized for this many distinct values at
the given `fpp`,
+ /// then folded down after insertion to achieve optimal size. A good
heuristic is to set
+ /// this to the expected number of rows in the row group. If fewer
distinct values are
+ /// actually written, the filter will be automatically compacted via
folding.
+ ///
+ /// Thus the only negative side of overestimating this value is that the
bloom filter
+ /// will use more memory during writing than necessary, but it will not
affect the final
+ /// bloom filter size on disk.
+ ///
+ /// If you wish to reduce memory usage during writing and are able to make
a reasonable estimate
+ /// of the number of distinct values in a row group, it is recommended to
set this value explicitly
+ /// rather than relying on the default dynamic sizing based on
`max_row_group_row_count`.
+ /// If you do set this value explicitly it is probably best to set it for
each column
+ /// individually via
[`WriterPropertiesBuilder::set_column_bloom_filter_ndv`] rather than globally,
+ /// since different columns may have different numbers of distinct values.
+ pub ndv: Option<u64>,
Review Comment:
since this is a public struct, I think this is a breaking API change:
https://docs.rs/parquet/58.0.0/parquet/file/properties/struct.BloomFilterProperties.html
##########
parquet/src/file/properties.rs:
##########
@@ -1201,28 +1213,37 @@ pub struct BloomFilterProperties {
/// smaller the fpp, the more memory and disk space is required, thus
setting it to a reasonable value
/// e.g. 0.1, 0.05, or 0.001 is recommended.
///
- /// Setting to a very small number diminishes the value of the filter
itself, as the bitset size is
- /// even larger than just storing the whole value. You are also expected
to set `ndv` if it can
- /// be known in advance to greatly reduce space usage.
+ /// This value also serves as the target FPP for bloom filter folding:
after all values
Review Comment:
This is a really cool feature -- to make bloom filter creation dynamically
adjust the bloom filter size based on the data. I really like that there is a
way to go back to the old explicit behavior too -- and solves a problem for
using bloom filters in Parquet.
As mentioned below this would be a breaking API change so would have to wait
for the next release
Given this is a pub struct with pub fields, I am not sure there is any way
to avoid a breaking API change 🤔
One thing I think that would make the API could potentially do would be to
make a builder
```rust
struct BloomFilterPropertiesBuilder {
...
}
```
And change the fields of `BloomFilterProperties` to not be public 🤔
(we could do this as a follow on PR)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]