Riza Suminto has uploaded this change for review. ( http://gerrit.cloudera.org:8080/22047
Change subject: IMPALA-2945: Account for duplicate keys on multiple nodes preAgg ...................................................................... IMPALA-2945: Account for duplicate keys on multiple nodes preAgg AggregationNode.computeStats() estimate cardinality under single node assumption. This can be an underestimation in preaggregation node case because same grouping key may exist in multiple nodes during preaggreation. This patch adjust the cardinality estimate using following model for the number of distinct values in a random sample of k rows. Assumes we are picking k rows from an infinite sample with ndv distinct values, with the value uniformly distributed. The probability of a given value not appearing in a sample, in that case is ((NDV - 1) / NDV) ^ k This is because we are making k choices, and each of them has (ndv - 1) / ndv chance of not being our value. Therefore the probability of a given value appearing in the sample is: 1 - ((NDV - 1) / NDV) ^ k And the number of distinct values in the sample is: (1 - ((NDV - 1) / NDV) ^ k) * NDV This adjustment is done during distributed planning and assume the minimum number of preaggregation instances per host. It intentionally does not use getNumInstances() because it might change during parallel planning phase or cost based planning phase. Therefore, preaggregation node cardinality may still underestimate in case of MT_DOP > 1 and COMPUTE_PROCESSING_COST = true. Testing: - Pass core tests. Change-Id: I04c563e59421928875b340cb91654b9d4bc80b55 --- M fe/src/main/java/org/apache/impala/planner/AggregationNode.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M testdata/bin/restore-stats-on-planner-tests.py M testdata/workloads/functional-planner/queries/PlannerTest/agg-node-high-mem-estimate.test M testdata/workloads/functional-planner/queries/PlannerTest/agg-node-low-mem-estimate.test M testdata/workloads/functional-planner/queries/PlannerTest/agg-node-max-mem-estimate.test M testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test M testdata/workloads/functional-planner/queries/PlannerTest/analytic-fns.test M testdata/workloads/functional-planner/queries/PlannerTest/card-agg.test M testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables-hash-join.test M testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables-resources.test M testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test M testdata/workloads/functional-planner/queries/PlannerTest/join-order.test M testdata/workloads/functional-planner/queries/PlannerTest/joins.test M testdata/workloads/functional-planner/queries/PlannerTest/multiple-distinct-limit.test M testdata/workloads/functional-planner/queries/PlannerTest/multiple-distinct-materialization.test M testdata/workloads/functional-planner/queries/PlannerTest/multiple-distinct-predicates.test M testdata/workloads/functional-planner/queries/PlannerTest/multiple-distinct.test M testdata/workloads/functional-planner/queries/PlannerTest/outer-joins.test M testdata/workloads/functional-planner/queries/PlannerTest/partition-key-scans-default.test M testdata/workloads/functional-planner/queries/PlannerTest/preagg-bytes-limit.test M testdata/workloads/functional-planner/queries/PlannerTest/processing-cost-plan-admission-slots.test M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test M testdata/workloads/functional-planner/queries/PlannerTest/shuffle-by-distinct-exprs.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds-processing-cost.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q02.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q04.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q05.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q06.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q07.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q08.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q09.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q11.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q12.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q13.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q14a.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q14b.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q15.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q18.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q20.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q21.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q22.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q23a.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q23b.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q24a.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q24b.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q26.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q27.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q28.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q31.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q32.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q33.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q36.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q37.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q38.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q39a.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q39b.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q40.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q42.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q43.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q44.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q45.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q48.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q49.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q50.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q59.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q60.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q61.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q62.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q64.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q65.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q66.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q69.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q70.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q74.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q76.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q77.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q80.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q82.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q86.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q87.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q88.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q90.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q92.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q96.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q97.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q98.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q99.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/ddl.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-ddl-iceberg.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-ddl-parquet.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q01.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q02.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q04.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q05.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q06.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q07.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q08.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q09.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q10a.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q11.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q12.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q13.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q14a.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q14b.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q15.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q18.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q19.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q20.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q21.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q22.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q23a.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q23b.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q24a.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q24b.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q26.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q27.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q28.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q30.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q31.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q32.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q33.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q35a.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q36.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q37.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q38.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q39a.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q39b.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q40.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q42.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q43-verbose.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q43.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q44.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q45.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q47.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q48.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q49.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q50.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q52.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q53.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q54.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q55.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q56.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q57.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q58.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q59.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q60.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q61.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q62.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q63.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q64.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q65.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q66.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q67.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q69.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q70.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q72.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q74.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q76.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q77.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q80.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q81.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q82.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q83.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q85.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q86.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q87.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q88.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q90.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q91.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q92.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q93.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q96.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q97.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q98.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q99.test M testdata/workloads/functional-planner/queries/PlannerTest/tpch-all.test M testdata/workloads/functional-planner/queries/PlannerTest/tpch-nested.test 179 files changed, 4,061 insertions(+), 3,928 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/22047/1 -- To view, visit http://gerrit.cloudera.org:8080/22047 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I04c563e59421928875b340cb91654b9d4bc80b55 Gerrit-Change-Number: 22047 Gerrit-PatchSet: 1 Gerrit-Owner: Riza Suminto <[email protected]>
