Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/22032 )
Change subject: IMPALA-13086: Lower AggregationNode estimate using stats predicate ...................................................................... Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/22032/2/testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q04.test File testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q04.test: http://gerrit.cloudera.org:8080/#/c/22032/2/testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q04.test@138 PS2, Line 138: | runtime filters: RF000[bloom] <- customer_id, RF001[min_max] <- customer_id > I think perf-AB-test is useful. I'll try it out. I ran perf-AB-test over modified branch, but essentially compare the patch stack up to this patch vs 7369ebb8ba02edfedcef071029b7bcd62157f452 (IMPALA-13415: Add a special testing mode to track Calcite progress) The result, almost half of TPC-DS 10GB scale queries regress by few miliseconds. https://jenkins.impala.io/job/perf-AB-test-ub2004/165/artifact/Impala/perf_results/latest/performance_result.txt I notice, the cardinality estimation before applying conjunct (HAVING predicate) is good, still above actual cardinality. However, it later hurts by those conjunct which applies default selectivity estimate of 10%. This can be seen in preaggregation node estimates such as: 52:AGGREGATE in Q11 https://jenkins.impala.io/job/perf-AB-test-ub2004/165/artifact/Impala/perf_results/latest/b308ed48be576a06e7206e26495713a766f09d7c_profiles/TPCDS-Q11_iter007.txt 50:AGGREGATE in Q74 https://jenkins.impala.io/job/perf-AB-test-ub2004/165/artifact/Impala/perf_results/latest/b308ed48be576a06e7206e26495713a766f09d7c_profiles/TPCDS-Q74_iter007.txt There are two solutions we can take to avoid underestimation here: - Raise default selectivity estimate higher than 10% for HAVING predicate. - Skip tuple-based cardinality entirely if Agg node has a HAVING predicate. First option will require some benchmark to decide what is the good value. It can still estimate wrong, and surprise user that upgrades to latest version. Second option is safer as temporary workaround until we can sort out other issues around Agg node planning like IMPALA-13526 and IMPALA-2945. I tested the second option and see 6 queries regress while the rest are improved. https://jenkins.impala.io/job/perf-AB-test-ub2004/166/artifact/Impala/perf_results/latest/performance_result.txt I will include that skipping into IMPALA-13465 patch since it has not merged yet. -- To view, visit http://gerrit.cloudera.org:8080/22032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia840d68f1c4f126d4e928461ec5c44545dbf25f8 Gerrit-Change-Number: 22032 Gerrit-PatchSet: 3 Gerrit-Owner: Riza Suminto <[email protected]> Gerrit-Reviewer: Aman Sinha <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Michael Smith <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Riza Suminto <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Tue, 12 Nov 2024 16:18:21 +0000 Gerrit-HasComments: Yes
