Re: [PR] PARQUET-2361: Reduce failure rate of unit test [parquet-mr]

via GitHub Sat, 14 Oct 2023 08:17:52 -0700


fengjiajie commented on code in PR #1170:
URL: https://github.com/apache/parquet-mr/pull/1170#discussion_r1359435099



##########
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java:
##########
@@ -314,32 +315,32 @@ public void testParquetFileWithBloomFilterWithFpp() 
throws IOException {
         .withConf(conf)
         .withDictionaryEncoding(false)
         .withBloomFilterEnabled("name", true)
-        .withBloomFilterNDV("name", totalCount)
-        .withBloomFilterFPP("name", testFpp[i])
+        .withBloomFilterNDV("name", buildBloomFilterCount)
+        .withBloomFilterFPP("name", testFpp)
         .build()) {
-        java.util.Iterator<String> iterator = distinctStrings.iterator();
-        while (iterator.hasNext()) {
-          writer.write(factory.newGroup().append("name", iterator.next()));
+        for (String str : distinctStringsForFileGenerate) {
+          writer.write(factory.newGroup().append("name", str));
         }
       }
-      distinctStrings.clear();
 
       try (ParquetFileReader reader = 
ParquetFileReader.open(HadoopInputFile.fromPath(path, new Configuration()))) {
         BlockMetaData blockMetaData = reader.getFooter().getBlocks().get(0);
         BloomFilter bloomFilter = 
reader.getBloomFilterDataReader(blockMetaData)
           .readBloomFilter(blockMetaData.getColumns().get(0));
 
-        // The exist counts the number of times FindHash returns true.
-        int exist = 0;
-        while (distinctStrings.size() < totalCount) {
-          String str = RandomStringUtils.randomAlphabetic(randomStrLen - 2);
-          if (distinctStrings.add(str) &&
+        // The false positive counts the number of times FindHash returns true.
+        int falsePositive = 0;
+        Set<String> distinctStringsForProbe = new HashSet<>();
+        while (distinctStringsForProbe.size() < testBloomFilterCount) {
+          String str = RandomStringUtils.randomAlphabetic(randomStrLen - 1);

Review Comment:
   @amousavigourabi  The purpose of unit testing is to ensure that the false 
positive rate of the bloom filter meets expectations. The current approach 
involves generating a bloom filter using many strings of length 12. Any data 
with a length other than 12 is guaranteed to not exist in the original data. 
For this data (length != 12), if the bloom filter returns true, it is 
considered a false positive. The false positive rate is then calculated by 
examining these cases. Alternatively, if we want to use strings of length 12 
for testing, we need to randomly generate strings and check if they exist in 
the original data. Only the ones that do not exist can be used to test the 
false positive rate.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] PARQUET-2361: Reduce failure rate of unit test [parquet-mr]

Reply via email to