[jira] [Commented] (IMPALA-13086) Cardinality estimate of AggregationNode should consider predicates on group-by columns

ASF subversion and git services (Jira) Sat, 18 Jan 2025 11:45:19 -0800


    [ 
https://issues.apache.org/jira/browse/IMPALA-13086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17914374#comment-17914374
 ]


ASF subversion and git services commented on IMPALA-13086:
----------------------------------------------------------

Commit c298c542621cb58ffe0772bf29ebdf7316cb77d1 in impala's branch 
refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=c298c5426 ]

IMPALA-13644: Generalize and move getPerInstanceNdvForCpuCosting

getPerInstanceNdvForCpuCosting is a method to estimate the number of
distinct values of exprs per fragment instance when accounting for the
likelihood of duplicate keys across fragment instances. It borrows the
probabilistic model described in IMPALA-2945. This method is exclusively
used by AggregationNode only.

getPerInstanceNdvForCpuCosting run the probabilistic formula
individually for each grouping expression and then multiply it together.
That match with how we estimate group NDV in the past where we simply do
NDV multiplication of each grouping expression.

Recently, we adds tuple-based analysis to lower cardinality estimate for
all kind of aggregation node (IMPALA-13045, IMPALA-13465, IMPALA-13086).
All of the bounding happens in AggregationNode.computeStats(), where we
call estimateNumGroups() function that returns globalNdv estimate for
specific aggregation class.

To take advantage from that more precise globalNdv, this patch replace
getPerInstanceNdvForCpuCosting() with estimatePreaggCardinality() that
apply the probabilistic formula over this single globalNdv number rather
than the old way where it often return an overestimated number from NDV
multiplication method. Its use is still limited only to calculate
ProcessingCost. Using it for preagg output cardinality will be done by
IMPALA-2945.

estimatePreaggCardinality is skipped if data partition of input is a
subset of grouping expression.

Testing:
- Run and pass PlannerTest that set COMPUTE_PROCESSING_COST=True.
  ProcessingCost changes, but all cardinality number stays.
- Add CardinalityTest#testEstimatePreaggCardinality.
- Update test_executor_groups.py. Enable v2 profile as well for easier
  runtime profile debugging.

Change-Id: Iddf75833981558fe0188ea7475b8d996d66983c1
Reviewed-on: http://gerrit.cloudera.org:8080/22320
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Cardinality estimate of AggregationNode should consider predicates on 
> group-by columns
> --------------------------------------------------------------------------------------
>
>                 Key: IMPALA-13086
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13086
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>            Reporter: Quanlong Huang
>            Assignee: Riza Suminto
>            Priority: Critical
>             Fix For: Impala 4.5.0
>
>         Attachments: plan.txt
>
>
> Consider the following tables:
> {code:sql}
> CREATE EXTERNAL TABLE t1(
>   t1_id bigint,
>   t5_id bigint,
>   t5_name string,
>   register_date string
> ) stored as textfile;
> CREATE EXTERNAL TABLE t2(
>   t1_id bigint,
>   t3_id bigint,
>   pay_time timestamp,
>   refund_time timestamp,
>   state_code int
> ) stored as textfile;
> CREATE EXTERNAL TABLE t3(
>   t3_id bigint,
>   t3_name string,
>   class_id int
> ) stored as textfile;
> CREATE EXTERNAL TABLE t5( 
>   id bigint,
>   t5_id bigint,
>   t5_name string,
>   branch_id bigint,
>   branch_name string
> ) stored as textfile;
> alter table t1 set tblproperties('numRows'='6031170829');
> alter table t1 set column stats t1_id ('numDVs'='8131016','numNulls'='0');
> alter table t1 set column stats t5_id ('numDVs'='389','numNulls'='0');
> alter table t1 set column stats t5_name 
> ('numDVs'='523','numNulls'='85928157','maxsize'='27','avgSize'='17.79120063781738');
> alter table t1 set column stats register_date 
> ('numDVs'='9283','numNulls'='0','maxsize'='8','avgSize'='8');
> alter table t2 set tblproperties('numRows'='864341085');
> alter table t2 set column stats t1_id ('numDVs'='1007302','numNulls'='0');
> alter table t2 set column stats t3_id ('numDVs'='5013','numNulls'='2800503');
> alter table t2 set column stats pay_time ('numDVs'='1372020','numNulls'='0');
> alter table t2 set column stats refund_time 
> ('numDVs'='251658','numNulls'='791645118');
> alter table t2 set column stats state_code ('numDVs'='8','numNulls'='0');
> alter table t3 set tblproperties('numRows'='4452');
> alter table t3 set column stats t3_id ('numDVs'='4452','numNulls'='0');
> alter table t3 set column stats t3_name 
> ('numDVs'='4452','numNulls'='0','maxsize'='176','avgSize'='37.60469818115234');
> alter table t3 set column stats class_id ('numDVs'='75','numNulls'='0');
> alter table t5 set tblproperties('numRows'='2177245');
> alter table t5 set column stats t5_id ('numDVs'='826','numNulls'='0');
> alter table t5 set column stats t5_name 
> ('numDVs'='523','numNulls'='0','maxsize'='67','avgSize'='19.12560081481934');
> alter table t5 set column stats branch_id ('numDVs'='53','numNulls'='0');
> alter table t5 set column stats branch_name 
> ('numDVs'='55','numNulls'='0','maxsize'='61','avgSize'='16.05229949951172');
> {code}
> Put a data file to each table to make the stats valid
> {code:bash}
> echo '2024' > data.txt
> hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t1
> hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t2
> hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t3
> hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t5
> {code}
> REFRESH these tables after adding the data files.
> The cardinality of AggregationNodes are overestimated in the following query:
> {code:sql}
> explain select 
>   register_date,
>   t4.t5_id, 
>   t5.t5_name,
>   t5.branch_name,
>   count(distinct t1_id),
>   count(distinct case when diff_day=0 then t1_id else null end ),
>   count(distinct case when diff_day<=3 then t1_id else null end ),
>   count(distinct case when diff_day<=7 then t1_id else null end ),
>   count(distinct case when diff_day<=14 then t1_id else null end ),
>   count(distinct case when diff_day<=30 then t1_id else null end ),
>   count(distinct case when diff_day<=60 then t1_id else null end ),
>   count(distinct case when pay_time is not null then t1_id else null end )
> from (
>   select t1.t1_id,t1.register_date,t1.t5_id,t2.pay_time,t2.t3_id,t3.t3_name,
>     datediff(pay_time,register_date) diff_day
>   from (
>     select t1_id,pay_time,t3_id from t2
>     where state_code = 0 and pay_time>=trunc(NOW(),'Y')
>       and cast(pay_time as date) <> cast(refund_time as date)
>   )t2
>   join t3 on t2.t3_id=t3.t3_id
>   right join t1 on t1.t1_id=t2.t1_id
> )t4
> left join t5 on t4.t5_id=t5.t5_id
> where register_date='20230515'
> group by register_date,t4.t5_id,t5.t5_name,t5.branch_name;{code}
> One of the AggregationNode:
> {noformat}
> 17:AGGREGATE [FINALIZE]
> |  Class 0
> |    output: count:merge(t1_id)
> |    group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> |  Class 1
> |    output: count:merge(CASE WHEN diff_day = 0 THEN t1_id ELSE NULL END)
> |    group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> |  Class 2
> |    output: count:merge(CASE WHEN diff_day <= 3 THEN t1_id ELSE NULL END)
> |    group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> |  Class 3
> |    output: count:merge(CASE WHEN diff_day <= 7 THEN t1_id ELSE NULL END)
> |    group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> |  Class 4
> |    output: count:merge(CASE WHEN diff_day <= 14 THEN t1_id ELSE NULL END)
> |    group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> |  Class 5
> |    output: count:merge(CASE WHEN diff_day <= 30 THEN t1_id ELSE NULL END)
> |    group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> |  Class 6
> |    output: count:merge(CASE WHEN diff_day <= 60 THEN t1_id ELSE NULL END)
> |    group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> |  Class 7
> |    output: count:merge(CASE WHEN pay_time IS NOT NULL THEN t1_id ELSE NULL 
> END)
> |    group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> |  mem-estimate=73.07TB mem-reservation=272.00MB thread-reservation=0
> |  tuple-ids=7N,9N,11N,13N,15N,17N,19N,21N row-size=761B cardinality=830.98G
> |  in pipelines: 17(GETNEXT), 15(OPEN){noformat}
> The cardinality is 830.98B which is the multiplied NDVs of the group by 
> columns and times the 8 aggregation classes: 9283 * 389 * 523 * 55 * 8. 
> (Thank [~boroknagyz] for digging into this!)
> This doesn't respect the predicates on the group by columns. Predicate 
> register_date='20230515' should reduce the NDV of 'register_date' to 1.
> CC [~rizaon]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-13086) Cardinality estimate of AggregationNode should consider predicates on group-by columns

Reply via email to