[GitHub] [parquet-mr] ggershinsky commented on a diff in pull request #975: PARQUET-2157: add bloom filter fpp config

GitBox Wed, 15 Jun 2022 06:44:48 -0700


ggershinsky commented on code in PR #975:
URL: https://github.com/apache/parquet-mr/pull/975#discussion_r898002511



##########
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java:
##########
@@ -282,6 +286,63 @@ public void testParquetFileWithBloomFilter() throws 
IOException {
     }
   }
 
+  @Test
+  public void testParquetFileWithBloomFilterWithFpp() throws IOException {
+    int totalCount = 100000;
+    double[] testFpp = {0.005, 0.01, 0.05, 0.10, 0.15, 0.20, 0.25};
+
+    Set<String> distinctStrings = new HashSet<>();
+    while (distinctStrings.size() < totalCount) {
+      String str = RandomStringUtils.randomAlphabetic(12);
+      distinctStrings.add(str);
+    }
+
+    MessageType schema = Types.buildMessage().
+      required(BINARY).as(stringType()).named("name").named("msg");
+
+    Configuration conf = new Configuration();
+    GroupWriteSupport.setSchema(schema, conf);
+
+    GroupFactory factory = new SimpleGroupFactory(schema);
+    for (int i = 0; i < testFpp.length; i++) {
+      File file = temp.newFile();
+      file.delete();
+      Path path = new Path(file.getAbsolutePath());
+      try (ParquetWriter<Group> writer = ExampleParquetWriter.builder(path)
+        .withPageRowCountLimit(10)
+        .withConf(conf)
+        .withDictionaryEncoding(false)
+        .withBloomFilterEnabled("name", true)
+        .withBloomFilterNDV("name", totalCount)
+        .withBloomFilterFPP("name", testFpp[i])
+        .build()) {
+        java.util.Iterator<String> iterator = distinctStrings.iterator();
+        while (iterator.hasNext()) {
+          writer.write(factory.newGroup().append("name", iterator.next()));
+        }
+      }
+      distinctStrings.clear();
+
+      try (ParquetFileReader reader = 
ParquetFileReader.open(HadoopInputFile.fromPath(path, new Configuration()))) {
+        BlockMetaData blockMetaData = reader.getFooter().getBlocks().get(0);
+        BloomFilter bloomFilter = 
reader.getBloomFilterDataReader(blockMetaData)
+          .readBloomFilter(blockMetaData.getColumns().get(0));
+
+        // The exist counts the number of times FindHash returns true.
+        int exist = 0;
+        while (distinctStrings.size() < totalCount) {
+          String str = RandomStringUtils.randomAlphabetic(10);
+          if (distinctStrings.add(str) &&
+            
bloomFilter.findHash(LongHashFunction.xx(0).hashBytes(Binary.fromString(str).toByteBuffer())))
 {
+            exist++;
+          }
+        }
+        // The exist should be less than totalCount * fpp. Add 10% here for 
error space.
+        assertTrue(exist < totalCount * (testFpp[i] * 1.1));

Review Comment:
   Two related questions:
   - what should be the `totalCount` to reliably ensure that a) `exist > 0` b) 
`exist < totalCount * (testFpp[i] * 1.1)` ? Depending on the `fpp` value, we 
can get a random assert exception if `totalCount` is too low (also, `exist` 
could be just 0 then). If `totalCount` is high, the unitest could take a very 
long time. 
   - how long does this unitest run on your laptop? (with the current 
`totalCount` of 100000).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-mr] ggershinsky commented on a diff in pull request #975: PARQUET-2157: add bloom filter fpp config

Reply via email to