Riza Suminto created IMPALA-13405:
-------------------------------------
Summary: Lower AggregationNode cardinality by analyzing estimate
of source Tuple
Key: IMPALA-13405
URL: https://issues.apache.org/jira/browse/IMPALA-13405
Project: IMPALA
Issue Type: Improvement
Components: Frontend
Affects Versions: Impala 4.4.0
Reporter: Riza Suminto
Assignee: Riza Suminto
If an aggregation node has multiple grouping expressions that originate from
the same tuple, then their combined NDV must not exceed output cardinality of
PlanNode producing that tuple. Take example of this PARALLELPLANS from
[Q31|https://github.com/apache/impala/blob/101e10b/testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q31.test].
{code:java}
| 11:AGGREGATE [STREAMING]
| | output: sum(ss_ext_sales_price)
| | group by: ca_county, d_qoy, d_year
| | mem-estimate=84.55MB mem-reservation=34.00MB spill-buffer=2.00MB
thread-reservation=0
| | tuple-ids=8 row-size=50B cardinality=1.43M cost=1948896250
| | in pipelines: 06(GETNEXT)
....
| | 07:SCAN HDFS [tpcds_partitioned_parquet_snap.date_dim, RANDOM]
| | HDFS partitions=1/1 files=1 size=2.17MB
| | predicates: tpcds_partitioned_parquet_snap.date_dim.d_year = CAST(1998
AS INT), tpcds_partitioned_parquet_snap.date_dim.d_qoy = CAST(2 AS INT)
| | stored statistics:
| | table: rows=73.05K size=2.17MB
| | columns: all
| | extrapolated-rows=disabled max-scan-range-rows=73.05K
| | parquet statistics predicates:
tpcds_partitioned_parquet_snap.date_dim.d_year = CAST(1998 AS INT),
tpcds_partitioned_parquet_snap.date_dim.d_qoy = CAST(2 AS INT)
| | parquet dictionary predicates:
tpcds_partitioned_parquet_snap.date_dim.d_year = CAST(1998 AS INT),
tpcds_partitioned_parquet_snap.date_dim.d_qoy = CAST(2 AS INT)
| | mem-estimate=16.00MB mem-reservation=512.00KB thread-reservation=0
| | tuple-ids=6 row-size=12B cardinality=186 cost=16728
| | in pipelines: 07(GETNEXT) {code}
Cardinality estimate of 11:AGGREGATE comes from this calculation:
{code:java}
est_cardinality(11:AGG) = NDV(ca_county) * NDV(d_qoy) * NDV (d_year)
= 1825 * 4 * 196
= 1430800{code}
However, d_qoy and d_year belong to the same TupleId 6 coming out from 07:SCAN,
so its cardinality can be estimated lower to this:
{code:java}
est_cardinality(11:AGG) = NDV(ca_county) * est_cardinality(07:SCAN)
= 1825 * 186
= 339450{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)