[ 
https://issues.apache.org/jira/browse/IMPALA-5633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Apple updated IMPALA-5633:
------------------------------
    Description: 
Block Bloom filters have a higher false positive rate than standard Bloom 
filter, due to the uneven distribution of keys between buckets. We should 
change the code to match the theory, using an approximation from the paper that 
introduced block Bloom filters, "Cache-, Hash- and Space-Efficient Bloom 
Filters" by Putze et al.

For a false positive probability of 1%, this would increase filter size by 
about 10% and a decrease filter false positive probability by 50%. However, 
this is obscured by the coarseness of the fact that filters are constrained to 
have a size in bytes that is a power of two. Loosening that restriction is 
potential future work.

See 
https://github.com/apache/kudu/commit/d1190c2b06a6eef91b21fd4a0b5fb76534b4e9f9

  was:
{{bloom-filter.cc}} says:

{noformat}
fpp = (1 - exp(-BUCKET_WORDS * ndv/space))^BUCKET_WORDS

where space is in bits.
{noformat}

This is true only discounting the false positive rate induced by hash 
collisions. Using {{w}}}-bit hashes, with {{n}} distinct values gives a false 
positive rate of

{noformat}
n / exp2(w) + (1.0 - n / exp2(w)) * ((1 - exp(-BUCKET_WORDS * 
ndv/space))^BUCKET_WORDS)
{noformat}

This starts to become a factor as {{n}} approaches {{1 << w}}. It also suggests 
using bitmaps rather than Bloom filters for large {{n}}, since the false 
positive rate for a bitmap is

{noformat}
n / exp2(w) + (1.0 - n / exp2(w)) * (1 - exp(-ndv/space))
{noformat}

This is lower than the current BF false positive rate for high {{n}} and low 
relative space (aka, high false positive probability).

        Summary: Bloom filters underestimate false positive probability  (was: 
Bloom filters underestimate false positive probability for high NDV)

> Bloom filters underestimate false positive probability
> ------------------------------------------------------
>
>                 Key: IMPALA-5633
>                 URL: https://issues.apache.org/jira/browse/IMPALA-5633
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Perf Investigation
>            Reporter: Jim Apple
>            Assignee: Jim Apple
>            Priority: Minor
>
> Block Bloom filters have a higher false positive rate than standard Bloom 
> filter, due to the uneven distribution of keys between buckets. We should 
> change the code to match the theory, using an approximation from the paper 
> that introduced block Bloom filters, "Cache-, Hash- and Space-Efficient Bloom 
> Filters" by Putze et al.
> For a false positive probability of 1%, this would increase filter size by 
> about 10% and a decrease filter false positive probability by 50%. However, 
> this is obscured by the coarseness of the fact that filters are constrained 
> to have a size in bytes that is a power of two. Loosening that restriction is 
> potential future work.
> See 
> https://github.com/apache/kudu/commit/d1190c2b06a6eef91b21fd4a0b5fb76534b4e9f9



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to