[GitHub] [parquet-mr] huaxingao commented on a diff in pull request #975: PARQUET-2157: add bloom filter fpp config

GitBox Wed, 15 Jun 2022 10:16:38 -0700


huaxingao commented on code in PR #975:
URL: https://github.com/apache/parquet-mr/pull/975#discussion_r898229962



##########
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java:
##########
@@ -282,6 +286,63 @@ public void testParquetFileWithBloomFilter() throws 
IOException {
     }
   }
 
+  @Test
+  public void testParquetFileWithBloomFilterWithFpp() throws IOException {
+    int totalCount = 100000;
+    double[] testFpp = {0.005, 0.01, 0.05, 0.10, 0.15, 0.20, 0.25};
+
+    Set<String> distinctStrings = new HashSet<>();
+    while (distinctStrings.size() < totalCount) {
+      String str = RandomStringUtils.randomAlphabetic(12);
+      distinctStrings.add(str);
+    }
+
+    MessageType schema = Types.buildMessage().
+      required(BINARY).as(stringType()).named("name").named("msg");
+
+    Configuration conf = new Configuration();
+    GroupWriteSupport.setSchema(schema, conf);
+
+    GroupFactory factory = new SimpleGroupFactory(schema);
+    for (int i = 0; i < testFpp.length; i++) {
+      File file = temp.newFile();
+      file.delete();
+      Path path = new Path(file.getAbsolutePath());
+      try (ParquetWriter<Group> writer = ExampleParquetWriter.builder(path)
+        .withPageRowCountLimit(10)
+        .withConf(conf)
+        .withDictionaryEncoding(false)
+        .withBloomFilterEnabled("name", true)
+        .withBloomFilterNDV("name", totalCount)
+        .withBloomFilterFPP("name", testFpp[i])
+        .build()) {
+        java.util.Iterator<String> iterator = distinctStrings.iterator();
+        while (iterator.hasNext()) {
+          writer.write(factory.newGroup().append("name", iterator.next()));
+        }
+      }
+      distinctStrings.clear();
+
+      try (ParquetFileReader reader = 
ParquetFileReader.open(HadoopInputFile.fromPath(path, new Configuration()))) {
+        BlockMetaData blockMetaData = reader.getFooter().getBlocks().get(0);
+        BloomFilter bloomFilter = 
reader.getBloomFilterDataReader(blockMetaData)
+          .readBloomFilter(blockMetaData.getColumns().get(0));
+
+        // The exist counts the number of times FindHash returns true.
+        int exist = 0;
+        while (distinctStrings.size() < totalCount) {
+          String str = RandomStringUtils.randomAlphabetic(10);
+          if (distinctStrings.add(str) &&
+            
bloomFilter.findHash(LongHashFunction.xx(0).hashBytes(Binary.fromString(str).toByteBuffer())))
 {
+            exist++;
+          }
+        }
+        // The exist should be less than totalCount * fpp. Add 10% here for 
error space.
+        assertTrue(exist < totalCount * (testFpp[i] * 1.1));

Review Comment:
   Basically `exist` > 0 is false positive. which happens when any given hash 
value that was never inserted into the bloom filter causes the check to return 
true. I don't think there is a simple closed-form calculation of this 
probability, but setting `totalCount` to be `100000` seems to be a pretty safe 
number for the test to pass.
   
   I am thinking we probably should disallow the Bloom filter's size to be 
unreasonably small. We currently only have the 
   maximum bytes of the Bloom filter. Shall we also have the minimum bytes of 
the Bloom filter? What do you think? @chenjunjiedada 
   
   The test takes about 2300 milli seconds on my laptop.
   
   
   
   



##########
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java:
##########
@@ -282,6 +286,63 @@ public void testParquetFileWithBloomFilter() throws 
IOException {
     }
   }
 
+  @Test
+  public void testParquetFileWithBloomFilterWithFpp() throws IOException {
+    int totalCount = 100000;
+    double[] testFpp = {0.005, 0.01, 0.05, 0.10, 0.15, 0.20, 0.25};
+
+    Set<String> distinctStrings = new HashSet<>();
+    while (distinctStrings.size() < totalCount) {
+      String str = RandomStringUtils.randomAlphabetic(12);
+      distinctStrings.add(str);
+    }
+
+    MessageType schema = Types.buildMessage().
+      required(BINARY).as(stringType()).named("name").named("msg");
+
+    Configuration conf = new Configuration();
+    GroupWriteSupport.setSchema(schema, conf);
+
+    GroupFactory factory = new SimpleGroupFactory(schema);
+    for (int i = 0; i < testFpp.length; i++) {
+      File file = temp.newFile();
+      file.delete();
+      Path path = new Path(file.getAbsolutePath());
+      try (ParquetWriter<Group> writer = ExampleParquetWriter.builder(path)
+        .withPageRowCountLimit(10)
+        .withConf(conf)
+        .withDictionaryEncoding(false)
+        .withBloomFilterEnabled("name", true)
+        .withBloomFilterNDV("name", totalCount)
+        .withBloomFilterFPP("name", testFpp[i])
+        .build()) {
+        java.util.Iterator<String> iterator = distinctStrings.iterator();
+        while (iterator.hasNext()) {
+          writer.write(factory.newGroup().append("name", iterator.next()));
+        }
+      }
+      distinctStrings.clear();
+
+      try (ParquetFileReader reader = 
ParquetFileReader.open(HadoopInputFile.fromPath(path, new Configuration()))) {
+        BlockMetaData blockMetaData = reader.getFooter().getBlocks().get(0);
+        BloomFilter bloomFilter = 
reader.getBloomFilterDataReader(blockMetaData)
+          .readBloomFilter(blockMetaData.getColumns().get(0));
+
+        // The exist counts the number of times FindHash returns true.
+        int exist = 0;
+        while (distinctStrings.size() < totalCount) {
+          String str = RandomStringUtils.randomAlphabetic(10);

Review Comment:
   Changed. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-mr] huaxingao commented on a diff in pull request #975: PARQUET-2157: add bloom filter fpp config

Reply via email to