Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/22032 )

Change subject: IMPALA-13086: Lower AggregationNode estimate using stats 
predicate
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/22032/2/testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q04.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q04.test:

http://gerrit.cloudera.org:8080/#/c/22032/2/testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q04.test@138
PS2, Line 138: |  runtime filters: RF000[bloom] <- customer_id, RF001[min_max] 
<- customer_id
> I think perf-AB-test is useful. I'll try it out.
I ran perf-AB-test over modified branch, but essentially compare the patch 
stack up to this patch vs 7369ebb8ba02edfedcef071029b7bcd62157f452 
(IMPALA-13415: Add a special testing mode to track Calcite progress)

The result, almost half of TPC-DS 10GB scale queries regress by few miliseconds.
https://jenkins.impala.io/job/perf-AB-test-ub2004/165/artifact/Impala/perf_results/latest/performance_result.txt
I notice, the cardinality estimation before applying conjunct (HAVING 
predicate) is good, still above actual cardinality. However, it later hurts by 
those conjunct which applies default selectivity estimate of 10%. This can be 
seen in preaggregation node estimates such as:

52:AGGREGATE in Q11
https://jenkins.impala.io/job/perf-AB-test-ub2004/165/artifact/Impala/perf_results/latest/b308ed48be576a06e7206e26495713a766f09d7c_profiles/TPCDS-Q11_iter007.txt

50:AGGREGATE in Q74
https://jenkins.impala.io/job/perf-AB-test-ub2004/165/artifact/Impala/perf_results/latest/b308ed48be576a06e7206e26495713a766f09d7c_profiles/TPCDS-Q74_iter007.txt

There are two solutions we can take to avoid underestimation here:
- Raise default selectivity estimate higher than 10% for HAVING predicate.
- Skip tuple-based cardinality entirely if Agg node has a HAVING predicate.

First option will require some benchmark to decide what is the good value. It 
can still estimate wrong, and surprise user that upgrades to latest version.

Second option is safer as temporary workaround until we can sort out other 
issues around Agg node planning like IMPALA-13526 and IMPALA-2945. I tested the 
second option and see 6 queries regress while the rest are improved.
https://jenkins.impala.io/job/perf-AB-test-ub2004/166/artifact/Impala/perf_results/latest/performance_result.txt

I will include that skipping into IMPALA-13465 patch since it has not merged 
yet.



--
To view, visit http://gerrit.cloudera.org:8080/22032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia840d68f1c4f126d4e928461ec5c44545dbf25f8
Gerrit-Change-Number: 22032
Gerrit-PatchSet: 3
Gerrit-Owner: Riza Suminto <[email protected]>
Gerrit-Reviewer: Aman Sinha <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Riza Suminto <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Tue, 12 Nov 2024 16:18:21 +0000
Gerrit-HasComments: Yes

Reply via email to