[ 
https://issues.apache.org/jira/browse/IMPALA-5633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17184916#comment-17184916
 ] 

Tim Armstrong commented on IMPALA-5633:
---------------------------------------

Ok, thanks for confirming. I thought about it a bit and it seems like there are 
probably a lot of reasons why the estimate might be biased high. I'll try to 
find some time to test it out on those TPC-DS queries.

> Bloom filters underestimate false positive probability
> ------------------------------------------------------
>
>                 Key: IMPALA-5633
>                 URL: https://issues.apache.org/jira/browse/IMPALA-5633
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Perf Investigation
>            Reporter: Jim Apple
>            Assignee: Jim Apple
>            Priority: Minor
>
> Block Bloom filters have a higher false positive rate than standard Bloom 
> filter, due to the uneven distribution of keys between buckets. We should 
> change the code to match the theory, using an approximation from the paper 
> that introduced block Bloom filters, "Cache-, Hash- and Space-Efficient Bloom 
> Filters" by Putze et al.
> For a false positive probability of 1%, this would increase filter size by 
> about 10% and a decrease filter false positive probability by 50%. However, 
> this is obscured by the coarseness of the fact that filters are constrained 
> to have a size in bytes that is a power of two. Loosening that restriction is 
> potential future work.
> See 
> https://github.com/apache/kudu/commit/d1190c2b06a6eef91b21fd4a0b5fb76534b4e9f9



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to