Joe McDonnell created IMPALA-14594:
--------------------------------------

             Summary: Grouping pre-agg's estimated input cardinality is not 
scaled by the number of threads
                 Key: IMPALA-14594
                 URL: https://issues.apache.org/jira/browse/IMPALA-14594
             Project: IMPALA
          Issue Type: Bug
          Components: Backend
    Affects Versions: Impala 5.0.0
            Reporter: Joe McDonnell


Grouping pre-aggs decide whether to keep expanding based on whether the 
aggregation is achieving reduction. It has two measurements: Current reduction 
is (# input rows - # rows returned passthrough) / # hash table entries. 
Predicted total reduction is given by this formula (taken from 
[be/src/exec/grouping-aggregator.cc|https://github.com/apache/impala/blob/4be5fd8896dcd445a6379bdcda4bdcf318f24511/be/src/exec/grouping-aggregator.cc#L376-L380]):

Extrapolate the current reduction factor (r) using the formula R = 1 + (N/n) * 
(r - 1), where R is the reduction factor over the full input data set, N is the 
number of input rows, excluding passed-through rows, and n is the number of 
rows inserted or merged into the hash tables. This is a very rough 
approximation but is good enough to be useful.

The predicted total reduction formula is sensitive to the estimated input 
cardinality. A fragment instance is only going to be processing a fraction of 
the input rows, so the estimated input cardinality should be scaled down to the 
number of input rows expected to be processed by this fragment instance (i.e. 
the total value should be divided by the number of fragment instances). That is 
currently not happening, so the estimated reduction is significantly higher 
than it should be. This causes pre-aggs to expand more aggressively.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to