Joe McDonnell created IMPALA-14594:
--------------------------------------
Summary: Grouping pre-agg's estimated input cardinality is not
scaled by the number of threads
Key: IMPALA-14594
URL: https://issues.apache.org/jira/browse/IMPALA-14594
Project: IMPALA
Issue Type: Bug
Components: Backend
Affects Versions: Impala 5.0.0
Reporter: Joe McDonnell
Grouping pre-aggs decide whether to keep expanding based on whether the
aggregation is achieving reduction. It has two measurements: Current reduction
is (# input rows - # rows returned passthrough) / # hash table entries.
Predicted total reduction is given by this formula (taken from
[be/src/exec/grouping-aggregator.cc|https://github.com/apache/impala/blob/4be5fd8896dcd445a6379bdcda4bdcf318f24511/be/src/exec/grouping-aggregator.cc#L376-L380]):
Extrapolate the current reduction factor (r) using the formula R = 1 + (N/n) *
(r - 1), where R is the reduction factor over the full input data set, N is the
number of input rows, excluding passed-through rows, and n is the number of
rows inserted or merged into the hash tables. This is a very rough
approximation but is good enough to be useful.
The predicted total reduction formula is sensitive to the estimated input
cardinality. A fragment instance is only going to be processing a fraction of
the input rows, so the estimated input cardinality should be scaled down to the
number of input rows expected to be processed by this fragment instance (i.e.
the total value should be divided by the number of fragment instances). That is
currently not happening, so the estimated reduction is significantly higher
than it should be. This causes pre-aggs to expand more aggressively.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)