viirya commented on code in PR #9628:
URL: https://github.com/apache/arrow-rs/pull/9628#discussion_r3035710496
##########
parquet/src/file/properties.rs:
##########
@@ -1201,20 +1231,30 @@ pub struct BloomFilterProperties {
/// smaller the fpp, the more memory and disk space is required, thus
setting it to a reasonable value
/// e.g. 0.1, 0.05, or 0.001 is recommended.
///
- /// Setting to a very small number diminishes the value of the filter
itself, as the bitset size is
- /// even larger than just storing the whole value. You are also expected
to set `ndv` if it can
- /// be known in advance to greatly reduce space usage.
+ /// This value also serves as the target FPP for bloom filter folding:
after all values
+ /// are inserted, the filter is folded down to the smallest size that
still meets this FPP.
pub fpp: f64,
- /// Number of distinct values, should be non-negative to be meaningful.
Defaults to [`DEFAULT_BLOOM_FILTER_NDV`].
+ /// Maximum expected number of distinct values. Defaults to
[`DEFAULT_BLOOM_FILTER_NDV`].
///
/// You should set this value by calling
[`WriterPropertiesBuilder::set_bloom_filter_ndv`].
///
- /// Usage of bloom filter is most beneficial for columns with large
cardinality, so a good heuristic
- /// is to set ndv to the number of rows. However, it can reduce disk size
if you know in advance a smaller
- /// number of distinct values. For very small ndv value it is probably not
worth it to use bloom filter
- /// anyway.
- ///
- /// Increasing this value (without increasing fpp) will result in an
increase in disk or memory size.
+ /// When not explicitly set via the builder, this defaults to
+ /// [`max_row_group_row_count`](WriterProperties::max_row_group_row_count)
(resolved at
+ /// build time). The bloom filter is initially sized for this many
distinct values at the
+ /// given `fpp`, then folded down after insertion to achieve optimal size.
A good heuristic
+ /// is to set this to the expected number of rows in the row group. If
fewer distinct values
+ /// are actually written, the filter will be automatically compacted via
folding.
+ ///
+ /// Thus the only negative side of overestimating this value is that the
bloom filter
+ /// will use more memory during writing than necessary, but it will not
affect the final
+ /// bloom filter size on disk.
+ ///
+ /// If you wish to reduce memory usage during writing and are able to make
a reasonable estimate
+ /// of the number of distinct values in a row group, it is recommended to
set this value explicitly
+ /// rather than relying on the default dynamic sizing based on
`max_row_group_row_count`.
+ /// If you do set this value explicitly it is probably best to set it for
each column
+ /// individually via
[`WriterPropertiesBuilder::set_column_bloom_filter_ndv`] rather than globally,
+ /// since different columns may have different numbers of distinct values.
pub ndv: u64,
Review Comment:
If we plan to remove it, we may need add deprecated?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]