[
https://issues.apache.org/jira/browse/IMPALA-5633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17184916#comment-17184916
]
Tim Armstrong commented on IMPALA-5633:
---------------------------------------
Ok, thanks for confirming. I thought about it a bit and it seems like there are
probably a lot of reasons why the estimate might be biased high. I'll try to
find some time to test it out on those TPC-DS queries.
> Bloom filters underestimate false positive probability
> ------------------------------------------------------
>
> Key: IMPALA-5633
> URL: https://issues.apache.org/jira/browse/IMPALA-5633
> Project: IMPALA
> Issue Type: Bug
> Components: Perf Investigation
> Reporter: Jim Apple
> Assignee: Jim Apple
> Priority: Minor
>
> Block Bloom filters have a higher false positive rate than standard Bloom
> filter, due to the uneven distribution of keys between buckets. We should
> change the code to match the theory, using an approximation from the paper
> that introduced block Bloom filters, "Cache-, Hash- and Space-Efficient Bloom
> Filters" by Putze et al.
> For a false positive probability of 1%, this would increase filter size by
> about 10% and a decrease filter false positive probability by 50%. However,
> this is obscured by the coarseness of the fact that filters are constrained
> to have a size in bytes that is a power of two. Loosening that restriction is
> potential future work.
> See
> https://github.com/apache/kudu/commit/d1190c2b06a6eef91b21fd4a0b5fb76534b4e9f9
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]