Quanlong Huang created IMPALA-13086:
---------------------------------------
Summary: Cardinality estimate of AggregationNode should consider
predicates on group-by columns
Key: IMPALA-13086
URL: https://issues.apache.org/jira/browse/IMPALA-13086
Project: IMPALA
Issue Type: Bug
Components: Frontend
Reporter: Quanlong Huang
Consider the following tables:
{code:sql}
CREATE EXTERNAL TABLE t1(
t1_id bigint,
t5_id bigint,
t5_name string,
register_date string
) stored as textfile;
CREATE EXTERNAL TABLE t2(
t1_id bigint,
t3_id bigint,
pay_time timestamp,
refund_time timestamp,
state_code int
) stored as textfile;
CREATE EXTERNAL TABLE t3(
t3_id bigint,
t3_name string,
class_id int
) stored as textfile;
CREATE EXTERNAL TABLE t5(
id bigint,
t5_id bigint,
t5_name string,
branch_id bigint,
branch_name string
) stored as textfile;
alter table t1 set tblproperties('numRows'='6031170829');
alter table t1 set column stats t1_id ('numDVs'='8131016','numNulls'='0');
alter table t1 set column stats t5_id ('numDVs'='389','numNulls'='0');
alter table t1 set column stats t5_name
('numDVs'='523','numNulls'='85928157','maxsize'='27','avgSize'='17.79120063781738');
alter table t1 set column stats register_date
('numDVs'='9283','numNulls'='0','maxsize'='8','avgSize'='8');
alter table t2 set tblproperties('numRows'='864341085');
alter table t2 set column stats t1_id ('numDVs'='1007302','numNulls'='0');
alter table t2 set column stats t3_id ('numDVs'='5013','numNulls'='2800503');
alter table t2 set column stats pay_time ('numDVs'='1372020','numNulls'='0');
alter table t2 set column stats refund_time
('numDVs'='251658','numNulls'='791645118');
alter table t2 set column stats state_code ('numDVs'='8','numNulls'='0');
alter table t3 set tblproperties('numRows'='4452');
alter table t3 set column stats t3_id ('numDVs'='4452','numNulls'='0');
alter table t3 set column stats t3_name
('numDVs'='4452','numNulls'='0','maxsize'='176','avgSize'='37.60469818115234');
alter table t3 set column stats class_id ('numDVs'='75','numNulls'='0');
alter table t5 set tblproperties('numRows'='2177245');
alter table t5 set column stats t5_id ('numDVs'='826','numNulls'='0');
alter table t5 set column stats t5_name
('numDVs'='523','numNulls'='0','maxsize'='67','avgSize'='19.12560081481934');
alter table t5 set column stats branch_id ('numDVs'='53','numNulls'='0');
alter table t5 set column stats branch_name
('numDVs'='55','numNulls'='0','maxsize'='61','avgSize'='16.05229949951172');
{code}
Put a data file to each table to make the stats valid
{code:bash}
echo '2024' > data.txt
hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t1
hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t2
hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t3
hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t5
{code}
REFRESH these tables after adding the data files.
The cardinality of AggregationNodes are overestimated in the following query:
{code:sql}
explain select
register_date,
t4.t5_id,
t5.t5_name,
t5.branch_name,
count(distinct t1_id),
count(distinct case when diff_day=0 then t1_id else null end ),
count(distinct case when diff_day<=3 then t1_id else null end ),
count(distinct case when diff_day<=7 then t1_id else null end ),
count(distinct case when diff_day<=14 then t1_id else null end ),
count(distinct case when diff_day<=30 then t1_id else null end ),
count(distinct case when diff_day<=60 then t1_id else null end ),
count(distinct case when pay_time is not null then t1_id else null end )
from (
select t1.t1_id,t1.register_date,t1.t5_id,t2.pay_time,t2.t3_id,t3.t3_name,
datediff(pay_time,register_date) diff_day
from (
select t1_id,pay_time,t3_id from t2
where state_code = 0 and pay_time>=trunc(NOW(),'Y')
and cast(pay_time as date) <> cast(refund_time as date)
)t2
join t3 on t2.t3_id=t3.t3_id
right join t1 on t1.t1_id=t2.t1_id
)t4
left join t5 on t4.t5_id=t5.t5_id
where register_date='20230515'
group by register_date,t4.t5_id,t5.t5_name,t5.branch_name;{code}
One of the AggregationNode:
{noformat}
17:AGGREGATE [FINALIZE]
| Class 0
| output: count:merge(t1_id)
| group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
| Class 1
| output: count:merge(CASE WHEN diff_day = 0 THEN t1_id ELSE NULL END)
| group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
| Class 2
| output: count:merge(CASE WHEN diff_day <= 3 THEN t1_id ELSE NULL END)
| group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
| Class 3
| output: count:merge(CASE WHEN diff_day <= 7 THEN t1_id ELSE NULL END)
| group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
| Class 4
| output: count:merge(CASE WHEN diff_day <= 14 THEN t1_id ELSE NULL END)
| group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
| Class 5
| output: count:merge(CASE WHEN diff_day <= 30 THEN t1_id ELSE NULL END)
| group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
| Class 6
| output: count:merge(CASE WHEN diff_day <= 60 THEN t1_id ELSE NULL END)
| group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
| Class 7
| output: count:merge(CASE WHEN pay_time IS NOT NULL THEN t1_id ELSE NULL
END)
| group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
| mem-estimate=73.07TB mem-reservation=272.00MB thread-reservation=0
| tuple-ids=7N,9N,11N,13N,15N,17N,19N,21N row-size=761B cardinality=830.98G
| in pipelines: 17(GETNEXT), 15(OPEN){noformat}
The cardinality is 830.98B which is the multiplied NDVs of the group by columns
and times the 8 aggregation classes: 9283 * 389 * 523 * 55 * 8. (Thank
[~boroknagyz] for digging into this!)
This doesn't respect the predicates on the group by columns. Predicate
register_date='20230515' should reduce the NDV of 'register_date' to 1.
CC [~rizaon]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)