Csaba Ringhofer created IMPALA-11928:
----------------------------------------
Summary: Try to delay runtime filter generation till NDV is known
Key: IMPALA-11928
URL: https://issues.apache.org/jira/browse/IMPALA-11928
Project: IMPALA
Issue Type: Improvement
Components: Backend
Reporter: Csaba Ringhofer
Currently runtime filters are initialized before starting to build the build
side hash table and are built in parallel to the hash table:
https://github.com/apache/impala/blob/b28da054f3595bb92873433211438306fc22fbc7/be/src/exec/partitioned-hash-join-builder-ir.cc#L66
This means that Impala has to rely on planner time estimates for the bloom
filter size to get the desired FPP.
In case the build side fits to memory it is possible to build the hash table
first and create the runtime filter by iterating through the keys in the hash
table. At this point the NDV of keys can be computed and bloom filters can be
set to have optimal sizes.
Agreeing on the correct size is more complex for shuffled joins as different
builders may get different key NDV, so synchronization is needed first before
starting to build the bloom filters.
If the hash table becomes too large and the builders start to still, it is
possibly better to fall back to build the bloom filter in parallel to the hash
table instead of rereading the spilled out partitions from disk once all data
has arrived. It is possible though that at this point the NDVs are already too
large so it is better to disable the filter.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]