[ 
https://issues.apache.org/jira/browse/PARQUET-2361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17775253#comment-17775253
 ] 

ASF GitHub Bot commented on PARQUET-2361:
-----------------------------------------

fengjiajie commented on code in PR #1170:
URL: https://github.com/apache/parquet-mr/pull/1170#discussion_r1359435099


##########
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java:
##########
@@ -314,32 +315,32 @@ public void testParquetFileWithBloomFilterWithFpp() 
throws IOException {
         .withConf(conf)
         .withDictionaryEncoding(false)
         .withBloomFilterEnabled("name", true)
-        .withBloomFilterNDV("name", totalCount)
-        .withBloomFilterFPP("name", testFpp[i])
+        .withBloomFilterNDV("name", buildBloomFilterCount)
+        .withBloomFilterFPP("name", testFpp)
         .build()) {
-        java.util.Iterator<String> iterator = distinctStrings.iterator();
-        while (iterator.hasNext()) {
-          writer.write(factory.newGroup().append("name", iterator.next()));
+        for (String str : distinctStringsForFileGenerate) {
+          writer.write(factory.newGroup().append("name", str));
         }
       }
-      distinctStrings.clear();
 
       try (ParquetFileReader reader = 
ParquetFileReader.open(HadoopInputFile.fromPath(path, new Configuration()))) {
         BlockMetaData blockMetaData = reader.getFooter().getBlocks().get(0);
         BloomFilter bloomFilter = 
reader.getBloomFilterDataReader(blockMetaData)
           .readBloomFilter(blockMetaData.getColumns().get(0));
 
-        // The exist counts the number of times FindHash returns true.
-        int exist = 0;
-        while (distinctStrings.size() < totalCount) {
-          String str = RandomStringUtils.randomAlphabetic(randomStrLen - 2);
-          if (distinctStrings.add(str) &&
+        // The false positive counts the number of times FindHash returns true.
+        int falsePositive = 0;
+        Set<String> distinctStringsForProbe = new HashSet<>();
+        while (distinctStringsForProbe.size() < testBloomFilterCount) {
+          String str = RandomStringUtils.randomAlphabetic(randomStrLen - 1);

Review Comment:
   @amousavigourabi  The purpose of unit testing is to ensure that the false 
positive rate of the bloom filter meets expectations. The current approach 
involves generating a bloom filter using many strings of length 12. Any data 
with a length other than 12 is guaranteed to not exist in the original data. 
For this data (length != 12), if the bloom filter returns true, it is 
considered a false positive. The false positive rate is then calculated by 
examining these cases. Alternatively, if we want to use strings of length 12 
for testing, we need to randomly generate strings and check if they exist in 
the original data. Only the ones that do not exist can be used to test the 
false positive rate.





> Reduce failure rate of unit test testParquetFileWithBloomFilterWithFpp
> ----------------------------------------------------------------------
>
>                 Key: PARQUET-2361
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2361
>             Project: Parquet
>          Issue Type: Test
>          Components: parquet-mr
>    Affects Versions: 1.13.2
>            Reporter: Feng Jiajie
>            Priority: Major
>
> {code:java}
> [INFO] Results:
> [INFO] 
> Error:  Failures: 
> Error:    TestParquetWriter.testParquetFileWithBloomFilterWithFpp:342
> [INFO]  {code}
> The unit test utilizes random string generation for test data without using a 
> fixed seed. The expectation of a unit test is that the number of false 
> positives in the Bloom filter should match the set probability. Therefore, a 
> simple fix is to increase the number of tests on the Bloom filter. The reason 
> for not using a fixed seed with random numbers is to avoid making the tests 
> effective only in specific scenarios. If it is necessary to use a fixed seed, 
> I can also modify the PR accordingly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to