Quanlong Huang created IMPALA-13086:
---------------------------------------

             Summary: Cardinality estimate of AggregationNode should consider 
predicates on group-by columns
                 Key: IMPALA-13086
                 URL: https://issues.apache.org/jira/browse/IMPALA-13086
             Project: IMPALA
          Issue Type: Bug
          Components: Frontend
            Reporter: Quanlong Huang


Consider the following tables:
{code:sql}
CREATE EXTERNAL TABLE t1(
  t1_id bigint,
  t5_id bigint,
  t5_name string,
  register_date string
) stored as textfile;

CREATE EXTERNAL TABLE t2(
  t1_id bigint,
  t3_id bigint,
  pay_time timestamp,
  refund_time timestamp,
  state_code int
) stored as textfile;

CREATE EXTERNAL TABLE t3(
  t3_id bigint,
  t3_name string,
  class_id int
) stored as textfile;

CREATE EXTERNAL TABLE t5( 
  id bigint,
  t5_id bigint,
  t5_name string,
  branch_id bigint,
  branch_name string
) stored as textfile;

alter table t1 set tblproperties('numRows'='6031170829');
alter table t1 set column stats t1_id ('numDVs'='8131016','numNulls'='0');
alter table t1 set column stats t5_id ('numDVs'='389','numNulls'='0');
alter table t1 set column stats t5_name 
('numDVs'='523','numNulls'='85928157','maxsize'='27','avgSize'='17.79120063781738');
alter table t1 set column stats register_date 
('numDVs'='9283','numNulls'='0','maxsize'='8','avgSize'='8');

alter table t2 set tblproperties('numRows'='864341085');
alter table t2 set column stats t1_id ('numDVs'='1007302','numNulls'='0');
alter table t2 set column stats t3_id ('numDVs'='5013','numNulls'='2800503');
alter table t2 set column stats pay_time ('numDVs'='1372020','numNulls'='0');
alter table t2 set column stats refund_time 
('numDVs'='251658','numNulls'='791645118');
alter table t2 set column stats state_code ('numDVs'='8','numNulls'='0');

alter table t3 set tblproperties('numRows'='4452');
alter table t3 set column stats t3_id ('numDVs'='4452','numNulls'='0');
alter table t3 set column stats t3_name 
('numDVs'='4452','numNulls'='0','maxsize'='176','avgSize'='37.60469818115234');
alter table t3 set column stats class_id ('numDVs'='75','numNulls'='0');

alter table t5 set tblproperties('numRows'='2177245');
alter table t5 set column stats t5_id ('numDVs'='826','numNulls'='0');
alter table t5 set column stats t5_name 
('numDVs'='523','numNulls'='0','maxsize'='67','avgSize'='19.12560081481934');
alter table t5 set column stats branch_id ('numDVs'='53','numNulls'='0');
alter table t5 set column stats branch_name 
('numDVs'='55','numNulls'='0','maxsize'='61','avgSize'='16.05229949951172');
{code}
Put a data file to each table to make the stats valid
{code:bash}
echo '2024' > data.txt
hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t1
hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t2
hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t3
hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t5
{code}
REFRESH these tables after adding the data files.

The cardinality of AggregationNodes are overestimated in the following query:
{code:sql}
explain select 
  register_date,
  t4.t5_id, 
  t5.t5_name,
  t5.branch_name,
  count(distinct t1_id),
  count(distinct case when diff_day=0 then t1_id else null end ),
  count(distinct case when diff_day<=3 then t1_id else null end ),
  count(distinct case when diff_day<=7 then t1_id else null end ),
  count(distinct case when diff_day<=14 then t1_id else null end ),
  count(distinct case when diff_day<=30 then t1_id else null end ),
  count(distinct case when diff_day<=60 then t1_id else null end ),
  count(distinct case when pay_time is not null then t1_id else null end )
from (
  select t1.t1_id,t1.register_date,t1.t5_id,t2.pay_time,t2.t3_id,t3.t3_name,
    datediff(pay_time,register_date) diff_day
  from (
    select t1_id,pay_time,t3_id from t2
    where state_code = 0 and pay_time>=trunc(NOW(),'Y')
      and cast(pay_time as date) <> cast(refund_time as date)
  )t2
  join t3 on t2.t3_id=t3.t3_id
  right join t1 on t1.t1_id=t2.t1_id
)t4
left join t5 on t4.t5_id=t5.t5_id
where register_date='20230515'
group by register_date,t4.t5_id,t5.t5_name,t5.branch_name;{code}
One of the AggregationNode:
{noformat}
17:AGGREGATE [FINALIZE]
|  Class 0
|    output: count:merge(t1_id)
|    group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
|  Class 1
|    output: count:merge(CASE WHEN diff_day = 0 THEN t1_id ELSE NULL END)
|    group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
|  Class 2
|    output: count:merge(CASE WHEN diff_day <= 3 THEN t1_id ELSE NULL END)
|    group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
|  Class 3
|    output: count:merge(CASE WHEN diff_day <= 7 THEN t1_id ELSE NULL END)
|    group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
|  Class 4
|    output: count:merge(CASE WHEN diff_day <= 14 THEN t1_id ELSE NULL END)
|    group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
|  Class 5
|    output: count:merge(CASE WHEN diff_day <= 30 THEN t1_id ELSE NULL END)
|    group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
|  Class 6
|    output: count:merge(CASE WHEN diff_day <= 60 THEN t1_id ELSE NULL END)
|    group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
|  Class 7
|    output: count:merge(CASE WHEN pay_time IS NOT NULL THEN t1_id ELSE NULL 
END)
|    group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
|  mem-estimate=73.07TB mem-reservation=272.00MB thread-reservation=0
|  tuple-ids=7N,9N,11N,13N,15N,17N,19N,21N row-size=761B cardinality=830.98G
|  in pipelines: 17(GETNEXT), 15(OPEN){noformat}
The cardinality is 830.98B which is the multiplied NDVs of the group by columns 
and times the 8 aggregation classes: 9283 * 389 * 523 * 55 * 8. (Thank 
[~boroknagyz] for digging into this!)

This doesn't respect the predicates on the group by columns. Predicate 
register_date='20230515' should reduce the NDV of 'register_date' to 1.

CC [~rizaon]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to