[
https://issues.apache.org/jira/browse/IMPALA-13405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Riza Suminto resolved IMPALA-13405.
-----------------------------------
Fix Version/s: Impala 4.5.0
Resolution: Fixed
> Lower AggregationNode cardinality by analyzing estimate of source Tuple
> -----------------------------------------------------------------------
>
> Key: IMPALA-13405
> URL: https://issues.apache.org/jira/browse/IMPALA-13405
> Project: IMPALA
> Issue Type: Improvement
> Components: Frontend
> Affects Versions: Impala 4.4.0
> Reporter: Riza Suminto
> Assignee: Riza Suminto
> Priority: Major
> Fix For: Impala 4.5.0
>
>
> If an aggregation node has multiple grouping expressions that originate from
> the same tuple, then their combined NDV must not exceed output cardinality of
> PlanNode producing that tuple. Take example of this PARALLELPLANS from
> [Q31|https://github.com/apache/impala/blob/101e10b/testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q31.test].
>
> {code:java}
> | 11:AGGREGATE [STREAMING]
> | | output: sum(ss_ext_sales_price)
> | | group by: ca_county, d_qoy, d_year
> | | mem-estimate=84.55MB mem-reservation=34.00MB spill-buffer=2.00MB
> thread-reservation=0
> | | tuple-ids=8 row-size=50B cardinality=1.43M cost=1948896250
> | | in pipelines: 06(GETNEXT)
> ....
> | | 07:SCAN HDFS [tpcds_partitioned_parquet_snap.date_dim, RANDOM]
> | | HDFS partitions=1/1 files=1 size=2.17MB
> | | predicates: tpcds_partitioned_parquet_snap.date_dim.d_year =
> CAST(1998 AS INT), tpcds_partitioned_parquet_snap.date_dim.d_qoy = CAST(2 AS
> INT)
> | | stored statistics:
> | | table: rows=73.05K size=2.17MB
> | | columns: all
> | | extrapolated-rows=disabled max-scan-range-rows=73.05K
> | | parquet statistics predicates:
> tpcds_partitioned_parquet_snap.date_dim.d_year = CAST(1998 AS INT),
> tpcds_partitioned_parquet_snap.date_dim.d_qoy = CAST(2 AS INT)
> | | parquet dictionary predicates:
> tpcds_partitioned_parquet_snap.date_dim.d_year = CAST(1998 AS INT),
> tpcds_partitioned_parquet_snap.date_dim.d_qoy = CAST(2 AS INT)
> | | mem-estimate=16.00MB mem-reservation=512.00KB thread-reservation=0
> | | tuple-ids=6 row-size=12B cardinality=186 cost=16728
> | | in pipelines: 07(GETNEXT) {code}
>
> Cardinality estimate of 11:AGGREGATE comes from this calculation:
> {code:java}
> est_cardinality(11:AGG) = NDV(ca_county) * NDV(d_qoy) * NDV (d_year)
> = 1825 * 4 * 196
> = 1430800{code}
> However, d_qoy and d_year belong to the same TupleId 6 coming out from
> 07:SCAN, so its cardinality can be estimated lower to this:
> {code:java}
> est_cardinality(11:AGG) = NDV(ca_county) * est_cardinality(07:SCAN)
> = 1825 * 186
> = 339450{code}
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)