[ 
https://issues.apache.org/jira/browse/IMPALA-13405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Riza Suminto resolved IMPALA-13405.
-----------------------------------
    Fix Version/s: Impala 4.5.0
       Resolution: Fixed

> Lower AggregationNode cardinality by analyzing estimate of source Tuple
> -----------------------------------------------------------------------
>
>                 Key: IMPALA-13405
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13405
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 4.4.0
>            Reporter: Riza Suminto
>            Assignee: Riza Suminto
>            Priority: Major
>             Fix For: Impala 4.5.0
>
>
> If an aggregation node has multiple grouping expressions that originate from 
> the same tuple, then their combined NDV must not exceed output cardinality of 
> PlanNode producing that tuple. Take example of this PARALLELPLANS from 
> [Q31|https://github.com/apache/impala/blob/101e10b/testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q31.test].
>  
> {code:java}
> |  11:AGGREGATE [STREAMING]
> |  |  output: sum(ss_ext_sales_price)
> |  |  group by: ca_county, d_qoy, d_year
> |  |  mem-estimate=84.55MB mem-reservation=34.00MB spill-buffer=2.00MB 
> thread-reservation=0
> |  |  tuple-ids=8 row-size=50B cardinality=1.43M cost=1948896250
> |  |  in pipelines: 06(GETNEXT)
> ....
> |  |  07:SCAN HDFS [tpcds_partitioned_parquet_snap.date_dim, RANDOM]
> |  |     HDFS partitions=1/1 files=1 size=2.17MB
> |  |     predicates: tpcds_partitioned_parquet_snap.date_dim.d_year = 
> CAST(1998 AS INT), tpcds_partitioned_parquet_snap.date_dim.d_qoy = CAST(2 AS 
> INT)
> |  |     stored statistics:
> |  |       table: rows=73.05K size=2.17MB
> |  |       columns: all
> |  |     extrapolated-rows=disabled max-scan-range-rows=73.05K
> |  |     parquet statistics predicates: 
> tpcds_partitioned_parquet_snap.date_dim.d_year = CAST(1998 AS INT), 
> tpcds_partitioned_parquet_snap.date_dim.d_qoy = CAST(2 AS INT)
> |  |     parquet dictionary predicates: 
> tpcds_partitioned_parquet_snap.date_dim.d_year = CAST(1998 AS INT), 
> tpcds_partitioned_parquet_snap.date_dim.d_qoy = CAST(2 AS INT)
> |  |     mem-estimate=16.00MB mem-reservation=512.00KB thread-reservation=0
> |  |     tuple-ids=6 row-size=12B cardinality=186 cost=16728
> |  |     in pipelines: 07(GETNEXT) {code}
>  
> Cardinality estimate of 11:AGGREGATE comes from this calculation:
> {code:java}
> est_cardinality(11:AGG) = NDV(ca_county) * NDV(d_qoy) * NDV (d_year)
>                         = 1825 * 4 * 196
>                         = 1430800{code}
> However, d_qoy and d_year belong to the same TupleId 6 coming out from 
> 07:SCAN, so its cardinality can be estimated lower to this:
> {code:java}
> est_cardinality(11:AGG) = NDV(ca_county) * est_cardinality(07:SCAN)
>                         = 1825 * 186
>                         = 339450{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to