Riza Suminto created IMPALA-13405:
-------------------------------------

             Summary: Lower AggregationNode cardinality by analyzing estimate 
of source Tuple
                 Key: IMPALA-13405
                 URL: https://issues.apache.org/jira/browse/IMPALA-13405
             Project: IMPALA
          Issue Type: Improvement
          Components: Frontend
    Affects Versions: Impala 4.4.0
            Reporter: Riza Suminto
            Assignee: Riza Suminto


If an aggregation node has multiple grouping expressions that originate from 
the same tuple, then their combined NDV must not exceed output cardinality of 
PlanNode producing that tuple. Take example of this PARALLELPLANS from 
[Q31|https://github.com/apache/impala/blob/101e10b/testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q31.test].

 
{code:java}
|  11:AGGREGATE [STREAMING]
|  |  output: sum(ss_ext_sales_price)
|  |  group by: ca_county, d_qoy, d_year
|  |  mem-estimate=84.55MB mem-reservation=34.00MB spill-buffer=2.00MB 
thread-reservation=0
|  |  tuple-ids=8 row-size=50B cardinality=1.43M cost=1948896250
|  |  in pipelines: 06(GETNEXT)

....

|  |  07:SCAN HDFS [tpcds_partitioned_parquet_snap.date_dim, RANDOM]
|  |     HDFS partitions=1/1 files=1 size=2.17MB
|  |     predicates: tpcds_partitioned_parquet_snap.date_dim.d_year = CAST(1998 
AS INT), tpcds_partitioned_parquet_snap.date_dim.d_qoy = CAST(2 AS INT)
|  |     stored statistics:
|  |       table: rows=73.05K size=2.17MB
|  |       columns: all
|  |     extrapolated-rows=disabled max-scan-range-rows=73.05K
|  |     parquet statistics predicates: 
tpcds_partitioned_parquet_snap.date_dim.d_year = CAST(1998 AS INT), 
tpcds_partitioned_parquet_snap.date_dim.d_qoy = CAST(2 AS INT)
|  |     parquet dictionary predicates: 
tpcds_partitioned_parquet_snap.date_dim.d_year = CAST(1998 AS INT), 
tpcds_partitioned_parquet_snap.date_dim.d_qoy = CAST(2 AS INT)
|  |     mem-estimate=16.00MB mem-reservation=512.00KB thread-reservation=0
|  |     tuple-ids=6 row-size=12B cardinality=186 cost=16728
|  |     in pipelines: 07(GETNEXT) {code}
 

Cardinality estimate of 11:AGGREGATE comes from this calculation:
{code:java}
est_cardinality(11:AGG) = NDV(ca_county) * NDV(d_qoy) * NDV (d_year)
                        = 1825 * 4 * 196
                        = 1430800{code}
However, d_qoy and d_year belong to the same TupleId 6 coming out from 07:SCAN, 
so its cardinality can be estimated lower to this:
{code:java}
est_cardinality(11:AGG) = NDV(ca_county) * est_cardinality(07:SCAN)
                        = 1825 * 186
                        = 339450{code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to