Csaba Ringhofer created IMPALA-11928:
----------------------------------------

             Summary: Try to delay runtime filter generation till NDV is known
                 Key: IMPALA-11928
                 URL: https://issues.apache.org/jira/browse/IMPALA-11928
             Project: IMPALA
          Issue Type: Improvement
          Components: Backend
            Reporter: Csaba Ringhofer


Currently runtime filters are initialized before starting to build the build 
side hash table and are built in parallel to the hash table:
https://github.com/apache/impala/blob/b28da054f3595bb92873433211438306fc22fbc7/be/src/exec/partitioned-hash-join-builder-ir.cc#L66
This means that Impala has to rely on planner time estimates for the bloom 
filter size to get the desired FPP.

In case the build side fits to memory it is possible to build the hash table 
first and create the runtime filter by iterating through the keys in the hash 
table. At this point the NDV of keys can be computed and bloom filters can be 
set to have optimal sizes.
Agreeing on the correct size is more complex for shuffled joins as different 
builders may get different key NDV, so synchronization is needed first before 
starting to build the bloom filters.

If the hash table becomes too large and the builders start to still, it is 
possibly better to fall back to build the bloom filter in parallel to the hash 
table instead of rereading the spilled out partitions from disk once all data 
has arrived. It is possible though that at this point the NDVs are already too 
large so it is better to disable the filter. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to