wgtmac commented on code in PR #1043:
URL: https://github.com/apache/parquet-mr/pull/1043#discussion_r1155587214


##########
parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java:
##########
@@ -97,7 +97,7 @@ abstract class ColumnWriterBase implements ColumnWriter {
       int optimalNumOfBits = 
BlockSplitBloomFilter.optimalNumOfBits(ndv.getAsLong(), fpp.getAsDouble());
       this.bloomFilter = new BlockSplitBloomFilter(optimalNumOfBits / 8, 
maxBloomFilterSize);
     } else {
-      this.bloomFilter = new BlockSplitBloomFilter(maxBloomFilterSize);
+      this.bloomFilter = BlockSplitBloomFilter.of(maxBloomFilterSize);

Review Comment:
   The goal is to fix the issue that bloom filter size can be larger than 
`parquet.bloom.filter.max.bytes`. The problematic constructors used are as 
below:
   
   ```java
       int maxBloomFilterSize = props.getMaxBloomFilterBytes();
   
       OptionalLong ndv = props.getBloomFilterNDV(path);
       OptionalDouble fpp = props.getBloomFilterFPP(path);
       // If user specify the column NDV, we construct Bloom filter from it.
       if (ndv.isPresent()) {
         int optimalNumOfBits = 
BlockSplitBloomFilter.optimalNumOfBits(ndv.getAsLong(), fpp.getAsDouble());
         this.bloomFilter = new BlockSplitBloomFilter(optimalNumOfBits / 8, 
maxBloomFilterSize);
       } else {
         this.bloomFilter = new BlockSplitBloomFilter(maxBloomFilterSize);
       }
   ```
   
   They are all delegated by `public BlockSplitBloomFilter(int numBytes, int 
minimumBytes, int maximumBytes, HashStrategy hashStrategy)`. 
   ```java
     public BlockSplitBloomFilter(int numBytes) {
       this(numBytes, LOWER_BOUND_BYTES, UPPER_BOUND_BYTES, HashStrategy.XXH64);
     }
   
     public BlockSplitBloomFilter(int numBytes, int maximumBytes) {
       this(numBytes, LOWER_BOUND_BYTES, maximumBytes, HashStrategy.XXH64);
     }
   ```
   
   So it seems to me that the issue comes from `public 
BlockSplitBloomFilter(int numBytes)` where `maxBloomFilterSize` is passed to 
`numBytes` and the actual size can be adjusted to values as large as 
`UPPER_BOUND_BYTES`.
   
   Is it possible to fix it like below?
   ```java
     public BlockSplitBloomFilter(int numBytes) {
       this(numBytes, LOWER_BOUND_BYTES, numBytes, HashStrategy.XXH64);
     }
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to