[ 
https://issues.apache.org/jira/browse/IMPALA-5633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17184908#comment-17184908
 ] 

Jim Apple commented on IMPALA-5633:
-----------------------------------

I think it's just a mistake. IIRC, [~mmokhtar] and I had a brief discussion 
about it a few years ago and he thought it was an error as well.

At the time, I wrote a patch to change it to 1% and ran some perf tests, which 
were inconclusive, likely due to me not picking judiciously which tests would 
demonstrate a difference, so the patch went in my backlog and has yet to escape.

I'd be delighted to +2 a patch that changed it and demonstrated the perf 
impact. It might be good to combine it with a patch to allow non-power-of-two 
sizes, which can be done without a modulo via

{code:c}
uint64_t libfilter_block_index(uint64_t hash, uint64_t num_buckets) {
  return (((unsigned __int128)hash) * ((unsigned __int128)num_buckets)) >> 64;
}
{code}


> Bloom filters underestimate false positive probability
> ------------------------------------------------------
>
>                 Key: IMPALA-5633
>                 URL: https://issues.apache.org/jira/browse/IMPALA-5633
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Perf Investigation
>            Reporter: Jim Apple
>            Assignee: Jim Apple
>            Priority: Minor
>
> Block Bloom filters have a higher false positive rate than standard Bloom 
> filter, due to the uneven distribution of keys between buckets. We should 
> change the code to match the theory, using an approximation from the paper 
> that introduced block Bloom filters, "Cache-, Hash- and Space-Efficient Bloom 
> Filters" by Putze et al.
> For a false positive probability of 1%, this would increase filter size by 
> about 10% and a decrease filter false positive probability by 50%. However, 
> this is obscured by the coarseness of the fact that filters are constrained 
> to have a size in bytes that is a power of two. Loosening that restriction is 
> potential future work.
> See 
> https://github.com/apache/kudu/commit/d1190c2b06a6eef91b21fd4a0b5fb76534b4e9f9



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to