[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

GitBox Sun, 15 Jan 2023 05:22:48 -0800


yabola commented on code in PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070591022



##########
parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java:
##########
@@ -181,6 +182,60 @@ public void testBloomFilterNDVs(){
     assertTrue(bytes < 5 * 1024 * 1024);
   }
 
+  @Test
+  public void testMergeBloomFilter() throws IOException {
+    Random random = new Random();
+    int numBytes = BlockSplitBloomFilter.optimalNumOfBits(1024 * 1024, 0.01) / 
8;
+    BloomFilter otherBloomFilter = new BlockSplitBloomFilter(numBytes);
+    BloomFilter mergedBloomFilter = new BlockSplitBloomFilter(numBytes);
+
+    Set<String> originStrings = new HashSet<>();
+    Set<String> testStrings = new HashSet<>();
+    Set<Integer> testInts = new HashSet<>();
+    Set<Double> testDoubles = new HashSet<>();
+    Set<Float> testFloats = new HashSet<>();
+    for (int i = 0; i < 1024; i++) {
+
+      String originStrValue = RandomStringUtils.randomAlphabetic(1, 64);
+      originStrings.add(originStrValue);
+      
mergedBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(originStrValue)));
+
+      String testString = RandomStringUtils.randomAlphabetic(1, 64);
+      testStrings.add(testString);
+      
otherBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(testString)));
+
+      int testInt = random.nextInt();
+      testInts.add(testInt);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testInt));
+
+      double testDouble = random.nextDouble();
+      testDoubles.add(testDouble);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testDouble));
+
+      float testFloat = random.nextFloat();
+      testFloats.add(testFloat);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testFloat));
+    }
+    mergedBloomFilter.merge(otherBloomFilter);
+    for (String testString : originStrings) {
+      
assertTrue(mergedBloomFilter.findHash(mergedBloomFilter.hash(Binary.fromString(testString))));

Review Comment:
   If the BloomFilter to be merged is not empty, there is a small probability 
that the two BloomFilters will be inconsistent when judging whether there is a 
hash value (I added random value).
   
   But if the BloomFilter to be merged is empty in the beginning, the result 
from these two BloomFilter should be always the same.
   
   I add two different test case, I am not sure if I need to add some more.



##########
parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java:
##########
@@ -181,6 +182,60 @@ public void testBloomFilterNDVs(){
     assertTrue(bytes < 5 * 1024 * 1024);
   }
 
+  @Test
+  public void testMergeBloomFilter() throws IOException {
+    Random random = new Random();
+    int numBytes = BlockSplitBloomFilter.optimalNumOfBits(1024 * 1024, 0.01) / 
8;
+    BloomFilter otherBloomFilter = new BlockSplitBloomFilter(numBytes);
+    BloomFilter mergedBloomFilter = new BlockSplitBloomFilter(numBytes);
+
+    Set<String> originStrings = new HashSet<>();
+    Set<String> testStrings = new HashSet<>();
+    Set<Integer> testInts = new HashSet<>();
+    Set<Double> testDoubles = new HashSet<>();
+    Set<Float> testFloats = new HashSet<>();
+    for (int i = 0; i < 1024; i++) {
+
+      String originStrValue = RandomStringUtils.randomAlphabetic(1, 64);
+      originStrings.add(originStrValue);
+      
mergedBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(originStrValue)));
+
+      String testString = RandomStringUtils.randomAlphabetic(1, 64);
+      testStrings.add(testString);
+      
otherBloomFilter.insertHash(otherBloomFilter.hash(Binary.fromString(testString)));
+
+      int testInt = random.nextInt();
+      testInts.add(testInt);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testInt));
+
+      double testDouble = random.nextDouble();
+      testDoubles.add(testDouble);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testDouble));
+
+      float testFloat = random.nextFloat();
+      testFloats.add(testFloat);
+      otherBloomFilter.insertHash(otherBloomFilter.hash(testFloat));
+    }
+    mergedBloomFilter.merge(otherBloomFilter);
+    for (String testString : originStrings) {
+      
assertTrue(mergedBloomFilter.findHash(mergedBloomFilter.hash(Binary.fromString(testString))));

Review Comment:
   If the BloomFilter to be merged is not empty, there is a small probability 
that the two BloomFilters will be inconsistent when judging whether there is a 
hash value (I added random value).
   
   But if the BloomFilter to be merged is empty in the beginning, the result 
from these two BloomFilter should be always the same.
   
   I add two different test cases, I am not sure if I need to add some more.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

Reply via email to