wgtmac commented on code in PR #50008:
URL: https://github.com/apache/arrow/pull/50008#discussion_r3323281209


##########
cpp/src/parquet/properties.h:
##########
@@ -206,6 +204,20 @@ struct PARQUET_EXPORT BloomFilterOptions {
   /// | 10,000,000 | 0.05  | 13.4     | 16384 KiB |
   /// | 10,000,000 | 0.01  | 13.4     | 16384 KiB |
   double fpp = 0.05;
+
+  /// Whether to fold the bloom filter before writing it.
+  ///
+  /// If true, the writer may fold the filter before serialization to reduce 
disk
+  /// usage while preserving the target fpp estimate. Highly skewed block 
occupancy
+  /// can make this estimate optimistic; disable folding to preserve the 
initial
+  /// filter size.
+  ///
+  /// The writer resolves unset ndv as follows:

Review Comment:
   nit: use a markdown table to better illustration



##########
cpp/src/parquet/bloom_filter.cc:
##########
@@ -361,6 +362,10 @@ uint32_t 
BlockSplitBloomFilter::NumFoldsForTargetFpp(double target_fpp) const {
   }
   DCHECK_EQ(num_blocks & (num_blocks - 1), 0);
 
+  // Estimate the fill rate after folding from the current average fill rate.
+  // Folding ORs block groups together, so each fold changes the estimated 
fill rate
+  // from f to 1 - (1 - f)^2. A membership check tests kBitsSetPerBlock bits, 
making
+  // the estimated FPP equal to folded_fill_rate^kBitsSetPerBlock.

Review Comment:
   Perhaps `pow(folded_fill_rate, kBitsSetPerBlock)` is less confusing to 
differentiate from bit operation.



##########
cpp/src/parquet/properties.h:
##########
@@ -206,6 +204,20 @@ struct PARQUET_EXPORT BloomFilterOptions {
   /// | 10,000,000 | 0.05  | 13.4     | 16384 KiB |
   /// | 10,000,000 | 0.01  | 13.4     | 16384 KiB |
   double fpp = 0.05;
+
+  /// Whether to fold the bloom filter before writing it.
+  ///
+  /// If true, the writer may fold the filter before serialization to reduce 
disk

Review Comment:
   Do we want to reference https://hal.science/hal-01126174v1 in the comment?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to