[
https://issues.apache.org/jira/browse/PARQUET-2361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17775249#comment-17775249
]
ASF GitHub Bot commented on PARQUET-2361:
-----------------------------------------
amousavigourabi commented on code in PR #1170:
URL: https://github.com/apache/parquet-mr/pull/1170#discussion_r1359433148
##########
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java:
##########
@@ -314,32 +315,32 @@ public void testParquetFileWithBloomFilterWithFpp()
throws IOException {
.withConf(conf)
.withDictionaryEncoding(false)
.withBloomFilterEnabled("name", true)
- .withBloomFilterNDV("name", totalCount)
- .withBloomFilterFPP("name", testFpp[i])
+ .withBloomFilterNDV("name", buildBloomFilterCount)
+ .withBloomFilterFPP("name", testFpp)
.build()) {
- java.util.Iterator<String> iterator = distinctStrings.iterator();
- while (iterator.hasNext()) {
- writer.write(factory.newGroup().append("name", iterator.next()));
+ for (String str : distinctStringsForFileGenerate) {
+ writer.write(factory.newGroup().append("name", str));
}
}
- distinctStrings.clear();
try (ParquetFileReader reader =
ParquetFileReader.open(HadoopInputFile.fromPath(path, new Configuration()))) {
BlockMetaData blockMetaData = reader.getFooter().getBlocks().get(0);
BloomFilter bloomFilter =
reader.getBloomFilterDataReader(blockMetaData)
.readBloomFilter(blockMetaData.getColumns().get(0));
- // The exist counts the number of times FindHash returns true.
- int exist = 0;
- while (distinctStrings.size() < totalCount) {
- String str = RandomStringUtils.randomAlphabetic(randomStrLen - 2);
- if (distinctStrings.add(str) &&
+ // The false positive counts the number of times FindHash returns true.
+ int falsePositive = 0;
+ Set<String> distinctStringsForProbe = new HashSet<>();
+ while (distinctStringsForProbe.size() < testBloomFilterCount) {
+ String str = RandomStringUtils.randomAlphabetic(randomStrLen - 1);
Review Comment:
Is there any reason this cannot be `randomStrLen` by the way?
> Reduce failure rate of unit test testParquetFileWithBloomFilterWithFpp
> ----------------------------------------------------------------------
>
> Key: PARQUET-2361
> URL: https://issues.apache.org/jira/browse/PARQUET-2361
> Project: Parquet
> Issue Type: Test
> Components: parquet-mr
> Affects Versions: 1.13.2
> Reporter: Feng Jiajie
> Priority: Major
>
> {code:java}
> [INFO] Results:
> [INFO]
> Error: Failures:
> Error: TestParquetWriter.testParquetFileWithBloomFilterWithFpp:342
> [INFO] {code}
> The unit test utilizes random string generation for test data without using a
> fixed seed. The expectation of a unit test is that the number of false
> positives in the Bloom filter should match the set probability. Therefore, a
> simple fix is to increase the number of tests on the Bloom filter. The reason
> for not using a fixed seed with random numbers is to avoid making the tests
> effective only in specific scenarios. If it is necessary to use a fixed seed,
> I can also modify the PR accordingly.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)