[
https://issues.apache.org/jira/browse/IMPALA-5633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17184908#comment-17184908
]
Jim Apple commented on IMPALA-5633:
-----------------------------------
I think it's just a mistake. IIRC, [~mmokhtar] and I had a brief discussion
about it a few years ago and he thought it was an error as well.
At the time, I wrote a patch to change it to 1% and ran some perf tests, which
were inconclusive, likely due to me not picking judiciously which tests would
demonstrate a difference, so the patch went in my backlog and has yet to escape.
I'd be delighted to +2 a patch that changed it and demonstrated the perf
impact. It might be good to combine it with a patch to allow non-power-of-two
sizes, which can be done without a modulo via
{code:c}
uint64_t libfilter_block_index(uint64_t hash, uint64_t num_buckets) {
return (((unsigned __int128)hash) * ((unsigned __int128)num_buckets)) >> 64;
}
{code}
> Bloom filters underestimate false positive probability
> ------------------------------------------------------
>
> Key: IMPALA-5633
> URL: https://issues.apache.org/jira/browse/IMPALA-5633
> Project: IMPALA
> Issue Type: Bug
> Components: Perf Investigation
> Reporter: Jim Apple
> Assignee: Jim Apple
> Priority: Minor
>
> Block Bloom filters have a higher false positive rate than standard Bloom
> filter, due to the uneven distribution of keys between buckets. We should
> change the code to match the theory, using an approximation from the paper
> that introduced block Bloom filters, "Cache-, Hash- and Space-Efficient Bloom
> Filters" by Putze et al.
> For a false positive probability of 1%, this would increase filter size by
> about 10% and a decrease filter false positive probability by 50%. However,
> this is obscured by the coarseness of the fact that filters are constrained
> to have a size in bytes that is a power of two. Loosening that restriction is
> potential future work.
> See
> https://github.com/apache/kudu/commit/d1190c2b06a6eef91b21fd4a0b5fb76534b4e9f9
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]