adriangb commented on code in PR #9628:
URL: https://github.com/apache/arrow-rs/pull/9628#discussion_r3035727114


##########
parquet/src/file/properties.rs:
##########
@@ -1201,20 +1231,30 @@ pub struct BloomFilterProperties {
     /// smaller the fpp, the more memory and disk space is required, thus 
setting it to a reasonable value
     /// e.g. 0.1, 0.05, or 0.001 is recommended.
     ///
-    /// Setting to a very small number diminishes the value of the filter 
itself, as the bitset size is
-    /// even larger than just storing the whole value. You are also expected 
to set `ndv` if it can
-    /// be known in advance to greatly reduce space usage.
+    /// This value also serves as the target FPP for bloom filter folding: 
after all values
+    /// are inserted, the filter is folded down to the smallest size that 
still meets this FPP.
     pub fpp: f64,
-    /// Number of distinct values, should be non-negative to be meaningful. 
Defaults to [`DEFAULT_BLOOM_FILTER_NDV`].
+    /// Maximum expected number of distinct values. Defaults to 
[`DEFAULT_BLOOM_FILTER_NDV`].
     ///
     /// You should set this value by calling 
[`WriterPropertiesBuilder::set_bloom_filter_ndv`].
     ///
-    /// Usage of bloom filter is most beneficial for columns with large 
cardinality, so a good heuristic
-    /// is to set ndv to the number of rows. However, it can reduce disk size 
if you know in advance a smaller
-    /// number of distinct values. For very small ndv value it is probably not 
worth it to use bloom filter
-    /// anyway.
-    ///
-    /// Increasing this value (without increasing fpp) will result in an 
increase in disk or memory size.
+    /// When not explicitly set via the builder, this defaults to
+    /// [`max_row_group_row_count`](WriterProperties::max_row_group_row_count) 
(resolved at
+    /// build time). The bloom filter is initially sized for this many 
distinct values at the
+    /// given `fpp`, then folded down after insertion to achieve optimal size. 
A good heuristic
+    /// is to set this to the expected number of rows in the row group. If 
fewer distinct values
+    /// are actually written, the filter will be automatically compacted via 
folding.
+    ///
+    /// Thus the only negative side of overestimating this value is that the 
bloom filter
+    /// will use more memory during writing than necessary, but it will not 
affect the final
+    /// bloom filter size on disk.
+    ///
+    /// If you wish to reduce memory usage during writing and are able to make 
a reasonable estimate
+    /// of the number of distinct values in a row group, it is recommended to 
set this value explicitly
+    /// rather than relying on the default dynamic sizing based on 
`max_row_group_row_count`.
+    /// If you do set this value explicitly it is probably best to set it for 
each column
+    /// individually via 
[`WriterPropertiesBuilder::set_column_bloom_filter_ndv`] rather than globally,
+    /// since different columns may have different numbers of distinct values.
     pub ndv: u64,

Review Comment:
   So the situation now is that the explicit ndv setting is useful if you know 
the ndv << row group size to reduce write memory usage, but we still always do 
the folding. So the ndv only controls the initially allocated size of the bloom 
filters.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to