wgtmac commented on code in PR #1043:
URL: https://github.com/apache/parquet-mr/pull/1043#discussion_r1142882212


##########
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java:
##########
@@ -85,6 +85,22 @@ public class BlockSplitBloomFilter implements BloomFilter {
   private static final int[] SALT = {0x47b6137b, 0x44974d91, 0x8824ad5b, 
0xa2b7289d,
     0x705495c7, 0x2df1424b, 0x9efc4947, 0x5c6bfb31};
 
+  /**
+   * Create a Bloom filter within the maximum bytes size. The bloom filter 
bytes size
+   * will be a largest power of 2 less than the maximum bytes size.
+   *
+   * @param maxBytes The max number of bytes for Bloom filter.
+   * @return
+   */
+  public static BlockSplitBloomFilter of(int maxBytes) {

Review Comment:
   Why adding a new function but not modify the constructor directly? It still 
has the risk if user uses the old constructor.



##########
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java:
##########
@@ -218,6 +234,19 @@ private void initBitset(int numBytes) {
     this.intBuffer = 
ByteBuffer.wrap(bitset).order(ByteOrder.LITTLE_ENDIAN).asIntBuffer();
   }
 
+  /**
+   * Calculate the largest power of 2 in the range [lowerBound,upperBound].
+   */
+  public static int largestTwoPowerInRange(int lowerBound, int upperBound) {

Review Comment:
   ```suggestion
     private static int largestPowerOfTwoInRange(int lowerBound, int 
upperBound) {
   ```



##########
parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java:
##########
@@ -97,7 +97,7 @@ abstract class ColumnWriterBase implements ColumnWriter {
       int optimalNumOfBits = 
BlockSplitBloomFilter.optimalNumOfBits(ndv.getAsLong(), fpp.getAsDouble());
       this.bloomFilter = new BlockSplitBloomFilter(optimalNumOfBits / 8, 
maxBloomFilterSize);
     } else {
-      this.bloomFilter = new BlockSplitBloomFilter(maxBloomFilterSize);
+      this.bloomFilter = BlockSplitBloomFilter.of(maxBloomFilterSize);

Review Comment:
   Does this change solve the problem?



##########
parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java:
##########
@@ -97,7 +97,7 @@ abstract class ColumnWriterBase implements ColumnWriter {
       int optimalNumOfBits = 
BlockSplitBloomFilter.optimalNumOfBits(ndv.getAsLong(), fpp.getAsDouble());
       this.bloomFilter = new BlockSplitBloomFilter(optimalNumOfBits / 8, 
maxBloomFilterSize);
     } else {
-      this.bloomFilter = new BlockSplitBloomFilter(maxBloomFilterSize);
+      this.bloomFilter = BlockSplitBloomFilter.of(maxBloomFilterSize);

Review Comment:
   I found that there is a constructor doing the similar thing:
   ```java
     /**
      * Constructor of block-based Bloom filter.
      *
      * @param numBytes The number of bytes for Bloom filter bitset. The range 
of num_bytes should be within
      *                 [DEFAULT_MINIMUM_BYTES, maximumBytes], it will be 
rounded up/down
      *                 to lower/upper bound if num_bytes is out of range. It 
will also be rounded up to a power
      *                 of 2. It uses XXH64 as its default hash function.
      * @param maximumBytes The maximum bytes of the Bloom filter.
      */
     public BlockSplitBloomFilter(int numBytes, int maximumBytes) {
       this(numBytes, LOWER_BOUND_BYTES, maximumBytes, HashStrategy.XXH64);
     }
   ```



##########
parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java:
##########
@@ -97,7 +97,7 @@ abstract class ColumnWriterBase implements ColumnWriter {
       int optimalNumOfBits = 
BlockSplitBloomFilter.optimalNumOfBits(ndv.getAsLong(), fpp.getAsDouble());
       this.bloomFilter = new BlockSplitBloomFilter(optimalNumOfBits / 8, 
maxBloomFilterSize);
     } else {
-      this.bloomFilter = new BlockSplitBloomFilter(maxBloomFilterSize);
+      this.bloomFilter = BlockSplitBloomFilter.of(maxBloomFilterSize);

Review Comment:
   ```suggestion
         this.bloomFilter = new BlockSplitBloomFilter(maxBloomFilterSize, 
maxBloomFilterSize);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to