fengjiajie commented on PR #1170: URL: https://github.com/apache/parquet-mr/pull/1170#issuecomment-1763633100
> > > @fengjiajie As the test is now evaluating the false positive rate with significantly more samples than what we use to build the filter, or are provided as NDV, might it not be the case that this will increase the false positive rate of the bloom filter to more than the FPP? Perhaps we could try increasing the NDV, or maybe an adaptive bloom filter might be more appropriate? WDYT? > > > > > > Nevermind, I don't think this should be an issue. I'll be running the test locally in a loop to see if I can reproduce the flake and check how often it might occur. > > I got 4 failures in 10k runs. This would mean 4 failures in 1250 full actions runs. These failures were all very slightly out of the expected range for the 0.01 fpp case. Given that we take 200k samples, this might indicate a flaw in the code, as 200k samples of some i.i.d. random variable ~ Bern(0.01) really should not have over 2200 hits that often, if ever. If we want to just fix this test I suggest raising the tolerance to 15%. That _should_ keep it from failing. @amousavigourabi I agree with increasing fault tolerance. Thank you very much for your review and testing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org