[
https://issues.apache.org/jira/browse/IMPALA-12455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767630#comment-17767630
]
Csaba Ringhofer commented on IMPALA-12455:
------------------------------------------
[~rizaon] I assumed that one the fist run we would use the existing min/max
values for the total size of the filter, not the per partition filter. This
should consume the same amount of memory on consumers and less memory on
producers. I think that ffp should not change because of this significantly -
writing smaller but disjunct filters should have similar ffp to using larger
ones, but union-ing them.
> Create set of disjunct bloom filters for keys in partitioned builds
> -------------------------------------------------------------------
>
> Key: IMPALA-12455
> URL: https://issues.apache.org/jira/browse/IMPALA-12455
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend, Frontend
> Reporter: Csaba Ringhofer
> Priority: Major
> Labels: bloom-filter, performance, runtime-filters
>
> Currently Impala aggregates bloom filters from different instances of the
> join builder by OR-ing them to a final filter. This could be avoided by
> having num_instances smaller bloom filters and choosing the correct one
> during lookup by doing the same hashing as used in partitioning. Builders
> would only need to write a single small filter as they have only keys from a
> single partition. This would make runtime filter producers faster and much
> more scalable while shouldn't have major effect on consumers.
> One caveat is that we push down the current bloom filter to Kudu as it is, so
> this optimization wouldn't be applicable in filters consumed by Kudu scans.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]