fengjiajie commented on code in PR #1170:
URL: https://github.com/apache/parquet-mr/pull/1170#discussion_r1359435099
##########
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java:
##########
@@ -314,32 +315,32 @@ public void testParquetFileWithBloomFilterWithFpp()
throws IOException {
.withConf(conf)
.withDictionaryEncoding(false)
.withBloomFilterEnabled("name", true)
- .withBloomFilterNDV("name", totalCount)
- .withBloomFilterFPP("name", testFpp[i])
+ .withBloomFilterNDV("name", buildBloomFilterCount)
+ .withBloomFilterFPP("name", testFpp)
.build()) {
- java.util.Iterator<String> iterator = distinctStrings.iterator();
- while (iterator.hasNext()) {
- writer.write(factory.newGroup().append("name", iterator.next()));
+ for (String str : distinctStringsForFileGenerate) {
+ writer.write(factory.newGroup().append("name", str));
}
}
- distinctStrings.clear();
try (ParquetFileReader reader =
ParquetFileReader.open(HadoopInputFile.fromPath(path, new Configuration()))) {
BlockMetaData blockMetaData = reader.getFooter().getBlocks().get(0);
BloomFilter bloomFilter =
reader.getBloomFilterDataReader(blockMetaData)
.readBloomFilter(blockMetaData.getColumns().get(0));
- // The exist counts the number of times FindHash returns true.
- int exist = 0;
- while (distinctStrings.size() < totalCount) {
- String str = RandomStringUtils.randomAlphabetic(randomStrLen - 2);
- if (distinctStrings.add(str) &&
+ // The false positive counts the number of times FindHash returns true.
+ int falsePositive = 0;
+ Set<String> distinctStringsForProbe = new HashSet<>();
+ while (distinctStringsForProbe.size() < testBloomFilterCount) {
+ String str = RandomStringUtils.randomAlphabetic(randomStrLen - 1);
Review Comment:
@amousavigourabi The purpose of unit testing is to ensure that the false
positive rate of the bloom filter meets expectations. The current approach
involves generating a bloom filter using many strings of length 12. Any data
with a length other than 12 is guaranteed to not exist in the original data.
For this data (length != 12), if the bloom filter returns true, it is
considered a false positive. The false positive rate is then calculated by
examining these cases. Alternatively, if we want to use strings of length 12
for testing, we need to randomly generate strings and check if they exist in
the original data. Only the ones that do not exist can be used to test the
false positive rate.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]