[jira] [Commented] (PARQUET-2361) Reduce failure rate of unit test testParquetFileWithBloomFilterWithFpp

ASF GitHub Bot (Jira) Sun, 15 Oct 2023 10:11:04 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17775395#comment-17775395
 ]


ASF GitHub Bot commented on PARQUET-2361:
-----------------------------------------

amousavigourabi commented on PR #1170:
URL: https://github.com/apache/parquet-mr/pull/1170#issuecomment-1763450248

   > > @fengjiajie As the test is now evaluating the false positive rate with 
significantly more samples than what we use to build the filter, or are 
provided as NDV, might it not be the case that this will increase the false 
positive rate of the bloom filter to more than the FPP? Perhaps we could try 
increasing the NDV, or maybe an adaptive bloom filter might be more 
appropriate? WDYT?
   > 
   > Nevermind, I don't think this should be an issue. I'll be running the test 
locally in a loop to see if I can reproduce the flake and check how often it 
might occur.
   
   I got 4 failures in 10k runs. This would mean 4 failures in 1250 full 
actions runs. These failures were all very slightly out of the expected range 
for the 0.01 fpp case. Given that we take 200k samples, this might indicate a 
flaw in the code, as 200k samples of some i.i.d. random variable ~ Bern(0.01) 
really should not have over 2200 hits that often, if ever. If we want to just 
fix this test I suggest raising the tolerance to 15%. That _should_ keep it 
from failing.




> Reduce failure rate of unit test testParquetFileWithBloomFilterWithFpp
> ----------------------------------------------------------------------
>
>                 Key: PARQUET-2361
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2361
>             Project: Parquet
>          Issue Type: Test
>          Components: parquet-mr
>    Affects Versions: 1.13.2
>            Reporter: Feng Jiajie
>            Priority: Major
>
> {code:java}
> [INFO] Results:
> [INFO] 
> Error:  Failures: 
> Error:    TestParquetWriter.testParquetFileWithBloomFilterWithFpp:342
> [INFO]  {code}
> The unit test utilizes random string generation for test data without using a 
> fixed seed. The expectation of a unit test is that the number of false 
> positives in the Bloom filter should match the set probability. Therefore, a 
> simple fix is to increase the number of tests on the Bloom filter. The reason 
> for not using a fixed seed with random numbers is to avoid making the tests 
> effective only in specific scenarios. If it is necessary to use a fixed seed, 
> I can also modify the PR accordingly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2361) Reduce failure rate of unit test testParquetFileWithBloomFilterWithFpp

Reply via email to