wgtmac commented on code in PR #50008:
URL: https://github.com/apache/arrow/pull/50008#discussion_r3323281209
##########
cpp/src/parquet/properties.h:
##########
@@ -206,6 +204,20 @@ struct PARQUET_EXPORT BloomFilterOptions {
/// | 10,000,000 | 0.05 | 13.4 | 16384 KiB |
/// | 10,000,000 | 0.01 | 13.4 | 16384 KiB |
double fpp = 0.05;
+
+ /// Whether to fold the bloom filter before writing it.
+ ///
+ /// If true, the writer may fold the filter before serialization to reduce
disk
+ /// usage while preserving the target fpp estimate. Highly skewed block
occupancy
+ /// can make this estimate optimistic; disable folding to preserve the
initial
+ /// filter size.
+ ///
+ /// The writer resolves unset ndv as follows:
Review Comment:
nit: use a markdown table to better illustration
##########
cpp/src/parquet/bloom_filter.cc:
##########
@@ -361,6 +362,10 @@ uint32_t
BlockSplitBloomFilter::NumFoldsForTargetFpp(double target_fpp) const {
}
DCHECK_EQ(num_blocks & (num_blocks - 1), 0);
+ // Estimate the fill rate after folding from the current average fill rate.
+ // Folding ORs block groups together, so each fold changes the estimated
fill rate
+ // from f to 1 - (1 - f)^2. A membership check tests kBitsSetPerBlock bits,
making
+ // the estimated FPP equal to folded_fill_rate^kBitsSetPerBlock.
Review Comment:
Perhaps `pow(folded_fill_rate, kBitsSetPerBlock)` is less confusing to
differentiate from bit operation.
##########
cpp/src/parquet/properties.h:
##########
@@ -206,6 +204,20 @@ struct PARQUET_EXPORT BloomFilterOptions {
/// | 10,000,000 | 0.05 | 13.4 | 16384 KiB |
/// | 10,000,000 | 0.01 | 13.4 | 16384 KiB |
double fpp = 0.05;
+
+ /// Whether to fold the bloom filter before writing it.
+ ///
+ /// If true, the writer may fold the filter before serialization to reduce
disk
Review Comment:
Do we want to reference https://hal.science/hal-01126174v1 in the comment?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]