[ https://issues.apache.org/jira/browse/PARQUET-2361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17775249#comment-17775249 ]
ASF GitHub Bot commented on PARQUET-2361: ----------------------------------------- amousavigourabi commented on code in PR #1170: URL: https://github.com/apache/parquet-mr/pull/1170#discussion_r1359433148 ########## parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java: ########## @@ -314,32 +315,32 @@ public void testParquetFileWithBloomFilterWithFpp() throws IOException { .withConf(conf) .withDictionaryEncoding(false) .withBloomFilterEnabled("name", true) - .withBloomFilterNDV("name", totalCount) - .withBloomFilterFPP("name", testFpp[i]) + .withBloomFilterNDV("name", buildBloomFilterCount) + .withBloomFilterFPP("name", testFpp) .build()) { - java.util.Iterator<String> iterator = distinctStrings.iterator(); - while (iterator.hasNext()) { - writer.write(factory.newGroup().append("name", iterator.next())); + for (String str : distinctStringsForFileGenerate) { + writer.write(factory.newGroup().append("name", str)); } } - distinctStrings.clear(); try (ParquetFileReader reader = ParquetFileReader.open(HadoopInputFile.fromPath(path, new Configuration()))) { BlockMetaData blockMetaData = reader.getFooter().getBlocks().get(0); BloomFilter bloomFilter = reader.getBloomFilterDataReader(blockMetaData) .readBloomFilter(blockMetaData.getColumns().get(0)); - // The exist counts the number of times FindHash returns true. - int exist = 0; - while (distinctStrings.size() < totalCount) { - String str = RandomStringUtils.randomAlphabetic(randomStrLen - 2); - if (distinctStrings.add(str) && + // The false positive counts the number of times FindHash returns true. + int falsePositive = 0; + Set<String> distinctStringsForProbe = new HashSet<>(); + while (distinctStringsForProbe.size() < testBloomFilterCount) { + String str = RandomStringUtils.randomAlphabetic(randomStrLen - 1); Review Comment: Is there any reason this cannot be `randomStrLen` by the way? > Reduce failure rate of unit test testParquetFileWithBloomFilterWithFpp > ---------------------------------------------------------------------- > > Key: PARQUET-2361 > URL: https://issues.apache.org/jira/browse/PARQUET-2361 > Project: Parquet > Issue Type: Test > Components: parquet-mr > Affects Versions: 1.13.2 > Reporter: Feng Jiajie > Priority: Major > > {code:java} > [INFO] Results: > [INFO] > Error: Failures: > Error: TestParquetWriter.testParquetFileWithBloomFilterWithFpp:342 > [INFO] {code} > The unit test utilizes random string generation for test data without using a > fixed seed. The expectation of a unit test is that the number of false > positives in the Bloom filter should match the set probability. Therefore, a > simple fix is to increase the number of tests on the Bloom filter. The reason > for not using a fixed seed with random numbers is to avoid making the tests > effective only in specific scenarios. If it is necessary to use a fixed seed, > I can also modify the PR accordingly. -- This message was sent by Atlassian Jira (v8.20.10#820010)