[jira] [Commented] (IMPALA-13086) Cardinality estimate of AggregationNode should consider predicates on group-by columns

ASF subversion and git services (Jira) Tue, 07 Jan 2025 18:49:26 -0800


    [ 
https://issues.apache.org/jira/browse/IMPALA-13086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17910898#comment-17910898
 ]


ASF subversion and git services commented on IMPALA-13086:
----------------------------------------------------------

Commit 617e99981e10e39fa769aaf9ffbbd6937cfa4a29 in impala's branch 
refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=617e99981 ]

IMPALA-13086: Lower AggregationNode estimate using stats predicate

NDV of a grouping column can be reduced if there is a predicate over
that column. If the predicate is a constant equality predicate or
is-null predicate, then the NDV must be equal to 1. If the predicate is
a simple in-list predicate, the NDV must be the number of items in the
list.

This patch adds such consideration by leveraging existing analysis in
HdfsScanNode.computeStatsTupleAndConjuncts(). It memorizes the first
ScanNode/UnionNode that produces a TupleId in Analyzer, registered
during Init()/computeStats() of the PlanNode. At AggregationNode, it
looks up the PlanNode that produces a TupleId. If the origin PlanNode is
an HdfsScanNode, analyze if any grouping expression is listed in
statsOriginalConjuncts_ and reduce them accordingly. If
HdfsScanNode.computeStatsTupleAndConjuncts() can be made generic for all
ScanNode implementations in the future, we can apply this same analysis
to all kinds of ScanNode and achieve the same reduction.

In terms of tracking producer PlanNode, this patch made an exception for
Iceberg PlanNodes that handle positional or equality deletion. In that
scenario, it is possible to have two ScanNodes sharing the same TupleId
to force UnionNode passthrough. Therefore, the UnionNode will be
acknowledged as the first producer of that TupleId.

This patch also remove some redundant operation in HdfsScanNode.
Fixed typo in method name MathUtil.saturatingMultiplyCardinalities().

Testing:
- Add new test cases in aggregation.test
- Pass core tests.

Change-Id: Ia840d68f1c4f126d4e928461ec5c44545dbf25f8
Reviewed-on: http://gerrit.cloudera.org:8080/22032
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Cardinality estimate of AggregationNode should consider predicates on 
> group-by columns
> --------------------------------------------------------------------------------------
>
>                 Key: IMPALA-13086
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13086
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>            Reporter: Quanlong Huang
>            Assignee: Riza Suminto
>            Priority: Critical
>             Fix For: Impala 4.5.0
>
>         Attachments: plan.txt
>
>
> Consider the following tables:
> {code:sql}
> CREATE EXTERNAL TABLE t1(
>   t1_id bigint,
>   t5_id bigint,
>   t5_name string,
>   register_date string
> ) stored as textfile;
> CREATE EXTERNAL TABLE t2(
>   t1_id bigint,
>   t3_id bigint,
>   pay_time timestamp,
>   refund_time timestamp,
>   state_code int
> ) stored as textfile;
> CREATE EXTERNAL TABLE t3(
>   t3_id bigint,
>   t3_name string,
>   class_id int
> ) stored as textfile;
> CREATE EXTERNAL TABLE t5( 
>   id bigint,
>   t5_id bigint,
>   t5_name string,
>   branch_id bigint,
>   branch_name string
> ) stored as textfile;
> alter table t1 set tblproperties('numRows'='6031170829');
> alter table t1 set column stats t1_id ('numDVs'='8131016','numNulls'='0');
> alter table t1 set column stats t5_id ('numDVs'='389','numNulls'='0');
> alter table t1 set column stats t5_name 
> ('numDVs'='523','numNulls'='85928157','maxsize'='27','avgSize'='17.79120063781738');
> alter table t1 set column stats register_date 
> ('numDVs'='9283','numNulls'='0','maxsize'='8','avgSize'='8');
> alter table t2 set tblproperties('numRows'='864341085');
> alter table t2 set column stats t1_id ('numDVs'='1007302','numNulls'='0');
> alter table t2 set column stats t3_id ('numDVs'='5013','numNulls'='2800503');
> alter table t2 set column stats pay_time ('numDVs'='1372020','numNulls'='0');
> alter table t2 set column stats refund_time 
> ('numDVs'='251658','numNulls'='791645118');
> alter table t2 set column stats state_code ('numDVs'='8','numNulls'='0');
> alter table t3 set tblproperties('numRows'='4452');
> alter table t3 set column stats t3_id ('numDVs'='4452','numNulls'='0');
> alter table t3 set column stats t3_name 
> ('numDVs'='4452','numNulls'='0','maxsize'='176','avgSize'='37.60469818115234');
> alter table t3 set column stats class_id ('numDVs'='75','numNulls'='0');
> alter table t5 set tblproperties('numRows'='2177245');
> alter table t5 set column stats t5_id ('numDVs'='826','numNulls'='0');
> alter table t5 set column stats t5_name 
> ('numDVs'='523','numNulls'='0','maxsize'='67','avgSize'='19.12560081481934');
> alter table t5 set column stats branch_id ('numDVs'='53','numNulls'='0');
> alter table t5 set column stats branch_name 
> ('numDVs'='55','numNulls'='0','maxsize'='61','avgSize'='16.05229949951172');
> {code}
> Put a data file to each table to make the stats valid
> {code:bash}
> echo '2024' > data.txt
> hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t1
> hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t2
> hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t3
> hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t5
> {code}
> REFRESH these tables after adding the data files.
> The cardinality of AggregationNodes are overestimated in the following query:
> {code:sql}
> explain select 
>   register_date,
>   t4.t5_id, 
>   t5.t5_name,
>   t5.branch_name,
>   count(distinct t1_id),
>   count(distinct case when diff_day=0 then t1_id else null end ),
>   count(distinct case when diff_day<=3 then t1_id else null end ),
>   count(distinct case when diff_day<=7 then t1_id else null end ),
>   count(distinct case when diff_day<=14 then t1_id else null end ),
>   count(distinct case when diff_day<=30 then t1_id else null end ),
>   count(distinct case when diff_day<=60 then t1_id else null end ),
>   count(distinct case when pay_time is not null then t1_id else null end )
> from (
>   select t1.t1_id,t1.register_date,t1.t5_id,t2.pay_time,t2.t3_id,t3.t3_name,
>     datediff(pay_time,register_date) diff_day
>   from (
>     select t1_id,pay_time,t3_id from t2
>     where state_code = 0 and pay_time>=trunc(NOW(),'Y')
>       and cast(pay_time as date) <> cast(refund_time as date)
>   )t2
>   join t3 on t2.t3_id=t3.t3_id
>   right join t1 on t1.t1_id=t2.t1_id
> )t4
> left join t5 on t4.t5_id=t5.t5_id
> where register_date='20230515'
> group by register_date,t4.t5_id,t5.t5_name,t5.branch_name;{code}
> One of the AggregationNode:
> {noformat}
> 17:AGGREGATE [FINALIZE]
> |  Class 0
> |    output: count:merge(t1_id)
> |    group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> |  Class 1
> |    output: count:merge(CASE WHEN diff_day = 0 THEN t1_id ELSE NULL END)
> |    group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> |  Class 2
> |    output: count:merge(CASE WHEN diff_day <= 3 THEN t1_id ELSE NULL END)
> |    group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> |  Class 3
> |    output: count:merge(CASE WHEN diff_day <= 7 THEN t1_id ELSE NULL END)
> |    group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> |  Class 4
> |    output: count:merge(CASE WHEN diff_day <= 14 THEN t1_id ELSE NULL END)
> |    group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> |  Class 5
> |    output: count:merge(CASE WHEN diff_day <= 30 THEN t1_id ELSE NULL END)
> |    group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> |  Class 6
> |    output: count:merge(CASE WHEN diff_day <= 60 THEN t1_id ELSE NULL END)
> |    group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> |  Class 7
> |    output: count:merge(CASE WHEN pay_time IS NOT NULL THEN t1_id ELSE NULL 
> END)
> |    group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> |  mem-estimate=73.07TB mem-reservation=272.00MB thread-reservation=0
> |  tuple-ids=7N,9N,11N,13N,15N,17N,19N,21N row-size=761B cardinality=830.98G
> |  in pipelines: 17(GETNEXT), 15(OPEN){noformat}
> The cardinality is 830.98B which is the multiplied NDVs of the group by 
> columns and times the 8 aggregation classes: 9283 * 389 * 523 * 55 * 8. 
> (Thank [~boroknagyz] for digging into this!)
> This doesn't respect the predicates on the group by columns. Predicate 
> register_date='20230515' should reduce the NDV of 'register_date' to 1.
> CC [~rizaon]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-13086) Cardinality estimate of AggregationNode should consider predicates on group-by columns

Reply via email to