[jira] [Commented] (IMPALA-13405) Lower AggregationNode cardinality by analyzing estimate of source Tuple

ASF subversion and git services (Jira) Fri, 13 Dec 2024 16:00:45 -0800


    [ 
https://issues.apache.org/jira/browse/IMPALA-13405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17905640#comment-17905640
 ]


ASF subversion and git services commented on IMPALA-13405:
----------------------------------------------------------

Commit 2828e473710cc246dbdeb9e1da772c45881cfddb in impala's branch 
refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=2828e4737 ]

IMPALA-13480: VALIDATE_CARDINALITY in some aggregation tests

This patch enable VALIDATE_CARDINALITY test options in several planner
tests that touch aggregation node. Enabling it has revealed three bugs.

First, in IMPALA-13405, cardinality estimate of MERGE phase aggregation
is not capped against the output cardinality of the EXCHANGE node. This
patch fix it by adding such capping.

Second, tuple-based optimization IMPALA-13405 can cause cardinality
underestimation if HAVING predicate exist. This is due to the default
selectivity of 10% applied for each HAVING predicate. This patch skip
tuple-based optimization if AggregationNode.conjuncts_ is ever not
empty. It will stay skipped on stats recompute, even if conjuncts_ is
transfered into the next Merge AggregationNode above the plan. The
optimization skip causes following PlannerTest (under
testdata/workloads/functional-planner/queries/PlannerTest/) to revert
their cardinality estimation to their state pior to IMPALA-13405:
- tpcds/tpcds-q39a.test
- tpcds/tpcds-q39b.test
- tpcds_cpu_cost/tpcds-q39a.test
- tpcds_cpu_cost/tpcds-q39b.test
In the future, we should consider raising the default selectivity for
HAVING predicate and undo this skipping logic (IMPALA-13542).

Third, is missing stats recompute after conjunct transfer in multi-phase
aggregation. This will be fixed separately by IMPALA-13526.

Testing:
- Enable cardinality validation in testMultipleDistinct*
- Update aggregation.test to reflect current PlannerTest output.
  Added some test cases in aggregation.test.
- Run and pass TpcdsPlannerTest and TpcdsCpuPlannerTest.
- Selectively run some more planner tests that touch AggregationNode and
  pass them.

Change-Id: Iadb4af9fd65fdb85b66fae1e403ccec8ca5eb102
Reviewed-on: http://gerrit.cloudera.org:8080/22184
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Lower AggregationNode cardinality by analyzing estimate of source Tuple
> -----------------------------------------------------------------------
>
>                 Key: IMPALA-13405
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13405
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 4.4.0
>            Reporter: Riza Suminto
>            Assignee: Riza Suminto
>            Priority: Major
>             Fix For: Impala 4.5.0
>
>
> If an aggregation node has multiple grouping expressions that originate from 
> the same tuple, then their combined NDV must not exceed output cardinality of 
> PlanNode producing that tuple. Take example of this PARALLELPLANS from 
> [Q31|https://github.com/apache/impala/blob/101e10b/testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q31.test].
>  
> {code:java}
> |  11:AGGREGATE [STREAMING]
> |  |  output: sum(ss_ext_sales_price)
> |  |  group by: ca_county, d_qoy, d_year
> |  |  mem-estimate=84.55MB mem-reservation=34.00MB spill-buffer=2.00MB 
> thread-reservation=0
> |  |  tuple-ids=8 row-size=50B cardinality=1.43M cost=1948896250
> |  |  in pipelines: 06(GETNEXT)
> ....
> |  |  07:SCAN HDFS [tpcds_partitioned_parquet_snap.date_dim, RANDOM]
> |  |     HDFS partitions=1/1 files=1 size=2.17MB
> |  |     predicates: tpcds_partitioned_parquet_snap.date_dim.d_year = 
> CAST(1998 AS INT), tpcds_partitioned_parquet_snap.date_dim.d_qoy = CAST(2 AS 
> INT)
> |  |     stored statistics:
> |  |       table: rows=73.05K size=2.17MB
> |  |       columns: all
> |  |     extrapolated-rows=disabled max-scan-range-rows=73.05K
> |  |     parquet statistics predicates: 
> tpcds_partitioned_parquet_snap.date_dim.d_year = CAST(1998 AS INT), 
> tpcds_partitioned_parquet_snap.date_dim.d_qoy = CAST(2 AS INT)
> |  |     parquet dictionary predicates: 
> tpcds_partitioned_parquet_snap.date_dim.d_year = CAST(1998 AS INT), 
> tpcds_partitioned_parquet_snap.date_dim.d_qoy = CAST(2 AS INT)
> |  |     mem-estimate=16.00MB mem-reservation=512.00KB thread-reservation=0
> |  |     tuple-ids=6 row-size=12B cardinality=186 cost=16728
> |  |     in pipelines: 07(GETNEXT) {code}
>  
> Cardinality estimate of 11:AGGREGATE comes from this calculation:
> {code:java}
> est_cardinality(11:AGG) = NDV(ca_county) * NDV(d_qoy) * NDV (d_year)
>                         = 1825 * 4 * 196
>                         = 1430800{code}
> However, d_qoy and d_year belong to the same TupleId 6 coming out from 
> 07:SCAN, so its cardinality can be estimated lower to this:
> {code:java}
> est_cardinality(11:AGG) = NDV(ca_county) * est_cardinality(07:SCAN)
>                         = 1825 * 186
>                         = 339450{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-13405) Lower AggregationNode cardinality by analyzing estimate of source Tuple

Reply via email to