Hello Impala Public Jenkins,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/22047
to look at the new patch set (#5).
Change subject: IMPALA-2945: Account for duplicate keys on multiple nodes preAgg
......................................................................
IMPALA-2945: Account for duplicate keys on multiple nodes preAgg
AggregationNode.computeStats() estimate cardinality under single node
assumption. This can be an underestimation in preaggregation node case
because same grouping key may exist in multiple nodes during
preaggreation.
This patch adjust the cardinality estimate using following model for the
number of distinct values in a random sample of k rows. Assumes we are
picking k rows from an infinite sample with ndv distinct values, with
the value uniformly distributed.
The probability of a given value not appearing in a sample, in that case
is
((NDV - 1) / NDV) ^ k
This is because we are making k choices, and each of them has
(ndv - 1) / ndv chance of not being our value. Therefore the
probability of a given value appearing in the sample is:
1 - ((NDV - 1) / NDV) ^ k
And the number of distinct values in the sample is:
(1 - ((NDV - 1) / NDV) ^ k) * NDV
This adjustment is done during distributed planning and assume the
minimum number of preaggregation instances per host. It intentionally
does not use getNumInstances() because it might change during parallel
planning phase or cost based planning phase. Therefore, preaggregation
node cardinality may still underestimate in case of MT_DOP > 1 and
COMPUTE_PROCESSING_COST = true.
Testing:
- Pass core tests.
Change-Id: I04c563e59421928875b340cb91654b9d4bc80b55
---
M fe/src/main/java/org/apache/impala/planner/AggregationNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/bin/restore-stats-on-planner-tests.py
M
testdata/workloads/functional-planner/queries/PlannerTest/agg-node-high-mem-estimate.test
M
testdata/workloads/functional-planner/queries/PlannerTest/agg-node-low-mem-estimate.test
M
testdata/workloads/functional-planner/queries/PlannerTest/agg-node-max-mem-estimate.test
M testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
M testdata/workloads/functional-planner/queries/PlannerTest/analytic-fns.test
M testdata/workloads/functional-planner/queries/PlannerTest/card-agg.test
M
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables-hash-join.test
M
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables-resources.test
M
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
M testdata/workloads/functional-planner/queries/PlannerTest/join-order.test
M testdata/workloads/functional-planner/queries/PlannerTest/joins.test
M
testdata/workloads/functional-planner/queries/PlannerTest/multiple-distinct-limit.test
M
testdata/workloads/functional-planner/queries/PlannerTest/multiple-distinct-materialization.test
M
testdata/workloads/functional-planner/queries/PlannerTest/multiple-distinct-predicates.test
M
testdata/workloads/functional-planner/queries/PlannerTest/multiple-distinct.test
M testdata/workloads/functional-planner/queries/PlannerTest/outer-joins.test
M
testdata/workloads/functional-planner/queries/PlannerTest/partition-key-scans-default.test
M
testdata/workloads/functional-planner/queries/PlannerTest/preagg-bytes-limit.test
M
testdata/workloads/functional-planner/queries/PlannerTest/processing-cost-plan-admission-slots.test
M
testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M
testdata/workloads/functional-planner/queries/PlannerTest/shuffle-by-distinct-exprs.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds-processing-cost.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q02.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q04.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q05.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q06.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q07.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q08.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q09.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q11.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q12.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q13.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q14a.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q14b.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q15.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q18.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q20.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q21.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q22.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q23a.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q23b.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q24a.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q24b.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q26.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q27.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q28.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q31.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q32.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q33.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q36.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q37.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q38.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q39a.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q39b.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q40.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q42.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q43.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q44.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q45.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q48.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q50.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q59.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q60.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q61.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q62.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q64.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q65.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q66.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q69.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q70.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q74.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q76.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q77.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q80.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q82.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q86.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q87.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q88.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q90.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q92.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q96.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q97.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q98.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q99.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/ddl.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-ddl-iceberg.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-ddl-parquet.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q01.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q02.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q04.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q05.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q06.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q07.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q08.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q09.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q10a.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q11.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q12.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q13.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q14a.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q14b.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q15.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q18.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q19.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q20.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q21.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q22.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q23a.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q23b.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q24a.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q24b.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q26.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q27.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q28.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q30.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q31.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q32.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q33.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q35a.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q36.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q37.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q38.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q40.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q42.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q43-verbose.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q43.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q44.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q45.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q47.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q48.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q49.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q50.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q52.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q53.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q54.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q55.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q56.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q57.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q58.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q59.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q60.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q61.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q62.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q63.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q64.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q65.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q66.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q67.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q69.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q70.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q72.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q74.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q76.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q77.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q80.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q81.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q82.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q83.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q85.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q86.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q87.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q88.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q90.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q91.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q92.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q93.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q96.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q97.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q98.test
M
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q99.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-all.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-nested.test
176 files changed, 3,833 insertions(+), 3,703 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/22047/5
--
To view, visit http://gerrit.cloudera.org:8080/22047
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I04c563e59421928875b340cb91654b9d4bc80b55
Gerrit-Change-Number: 22047
Gerrit-PatchSet: 5
Gerrit-Owner: Riza Suminto <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Riza Suminto <[email protected]>