[
https://issues.apache.org/jira/browse/IMPALA-13086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Riza Suminto resolved IMPALA-13086.
-----------------------------------
Fix Version/s: Impala 4.5.0
Target Version: Impala 4.5.0
Resolution: Fixed
> Cardinality estimate of AggregationNode should consider predicates on
> group-by columns
> --------------------------------------------------------------------------------------
>
> Key: IMPALA-13086
> URL: https://issues.apache.org/jira/browse/IMPALA-13086
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Reporter: Quanlong Huang
> Assignee: Riza Suminto
> Priority: Critical
> Fix For: Impala 4.5.0
>
> Attachments: plan.txt
>
>
> Consider the following tables:
> {code:sql}
> CREATE EXTERNAL TABLE t1(
> t1_id bigint,
> t5_id bigint,
> t5_name string,
> register_date string
> ) stored as textfile;
> CREATE EXTERNAL TABLE t2(
> t1_id bigint,
> t3_id bigint,
> pay_time timestamp,
> refund_time timestamp,
> state_code int
> ) stored as textfile;
> CREATE EXTERNAL TABLE t3(
> t3_id bigint,
> t3_name string,
> class_id int
> ) stored as textfile;
> CREATE EXTERNAL TABLE t5(
> id bigint,
> t5_id bigint,
> t5_name string,
> branch_id bigint,
> branch_name string
> ) stored as textfile;
> alter table t1 set tblproperties('numRows'='6031170829');
> alter table t1 set column stats t1_id ('numDVs'='8131016','numNulls'='0');
> alter table t1 set column stats t5_id ('numDVs'='389','numNulls'='0');
> alter table t1 set column stats t5_name
> ('numDVs'='523','numNulls'='85928157','maxsize'='27','avgSize'='17.79120063781738');
> alter table t1 set column stats register_date
> ('numDVs'='9283','numNulls'='0','maxsize'='8','avgSize'='8');
> alter table t2 set tblproperties('numRows'='864341085');
> alter table t2 set column stats t1_id ('numDVs'='1007302','numNulls'='0');
> alter table t2 set column stats t3_id ('numDVs'='5013','numNulls'='2800503');
> alter table t2 set column stats pay_time ('numDVs'='1372020','numNulls'='0');
> alter table t2 set column stats refund_time
> ('numDVs'='251658','numNulls'='791645118');
> alter table t2 set column stats state_code ('numDVs'='8','numNulls'='0');
> alter table t3 set tblproperties('numRows'='4452');
> alter table t3 set column stats t3_id ('numDVs'='4452','numNulls'='0');
> alter table t3 set column stats t3_name
> ('numDVs'='4452','numNulls'='0','maxsize'='176','avgSize'='37.60469818115234');
> alter table t3 set column stats class_id ('numDVs'='75','numNulls'='0');
> alter table t5 set tblproperties('numRows'='2177245');
> alter table t5 set column stats t5_id ('numDVs'='826','numNulls'='0');
> alter table t5 set column stats t5_name
> ('numDVs'='523','numNulls'='0','maxsize'='67','avgSize'='19.12560081481934');
> alter table t5 set column stats branch_id ('numDVs'='53','numNulls'='0');
> alter table t5 set column stats branch_name
> ('numDVs'='55','numNulls'='0','maxsize'='61','avgSize'='16.05229949951172');
> {code}
> Put a data file to each table to make the stats valid
> {code:bash}
> echo '2024' > data.txt
> hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t1
> hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t2
> hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t3
> hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t5
> {code}
> REFRESH these tables after adding the data files.
> The cardinality of AggregationNodes are overestimated in the following query:
> {code:sql}
> explain select
> register_date,
> t4.t5_id,
> t5.t5_name,
> t5.branch_name,
> count(distinct t1_id),
> count(distinct case when diff_day=0 then t1_id else null end ),
> count(distinct case when diff_day<=3 then t1_id else null end ),
> count(distinct case when diff_day<=7 then t1_id else null end ),
> count(distinct case when diff_day<=14 then t1_id else null end ),
> count(distinct case when diff_day<=30 then t1_id else null end ),
> count(distinct case when diff_day<=60 then t1_id else null end ),
> count(distinct case when pay_time is not null then t1_id else null end )
> from (
> select t1.t1_id,t1.register_date,t1.t5_id,t2.pay_time,t2.t3_id,t3.t3_name,
> datediff(pay_time,register_date) diff_day
> from (
> select t1_id,pay_time,t3_id from t2
> where state_code = 0 and pay_time>=trunc(NOW(),'Y')
> and cast(pay_time as date) <> cast(refund_time as date)
> )t2
> join t3 on t2.t3_id=t3.t3_id
> right join t1 on t1.t1_id=t2.t1_id
> )t4
> left join t5 on t4.t5_id=t5.t5_id
> where register_date='20230515'
> group by register_date,t4.t5_id,t5.t5_name,t5.branch_name;{code}
> One of the AggregationNode:
> {noformat}
> 17:AGGREGATE [FINALIZE]
> | Class 0
> | output: count:merge(t1_id)
> | group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> | Class 1
> | output: count:merge(CASE WHEN diff_day = 0 THEN t1_id ELSE NULL END)
> | group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> | Class 2
> | output: count:merge(CASE WHEN diff_day <= 3 THEN t1_id ELSE NULL END)
> | group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> | Class 3
> | output: count:merge(CASE WHEN diff_day <= 7 THEN t1_id ELSE NULL END)
> | group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> | Class 4
> | output: count:merge(CASE WHEN diff_day <= 14 THEN t1_id ELSE NULL END)
> | group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> | Class 5
> | output: count:merge(CASE WHEN diff_day <= 30 THEN t1_id ELSE NULL END)
> | group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> | Class 6
> | output: count:merge(CASE WHEN diff_day <= 60 THEN t1_id ELSE NULL END)
> | group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> | Class 7
> | output: count:merge(CASE WHEN pay_time IS NOT NULL THEN t1_id ELSE NULL
> END)
> | group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> | mem-estimate=73.07TB mem-reservation=272.00MB thread-reservation=0
> | tuple-ids=7N,9N,11N,13N,15N,17N,19N,21N row-size=761B cardinality=830.98G
> | in pipelines: 17(GETNEXT), 15(OPEN){noformat}
> The cardinality is 830.98B which is the multiplied NDVs of the group by
> columns and times the 8 aggregation classes: 9283 * 389 * 523 * 55 * 8.
> (Thank [~boroknagyz] for digging into this!)
> This doesn't respect the predicates on the group by columns. Predicate
> register_date='20230515' should reduce the NDV of 'register_date' to 1.
> CC [~rizaon]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)