[
https://issues.apache.org/jira/browse/IMPALA-13405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17905640#comment-17905640
]
ASF subversion and git services commented on IMPALA-13405:
----------------------------------------------------------
Commit 2828e473710cc246dbdeb9e1da772c45881cfddb in impala's branch
refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=2828e4737 ]
IMPALA-13480: VALIDATE_CARDINALITY in some aggregation tests
This patch enable VALIDATE_CARDINALITY test options in several planner
tests that touch aggregation node. Enabling it has revealed three bugs.
First, in IMPALA-13405, cardinality estimate of MERGE phase aggregation
is not capped against the output cardinality of the EXCHANGE node. This
patch fix it by adding such capping.
Second, tuple-based optimization IMPALA-13405 can cause cardinality
underestimation if HAVING predicate exist. This is due to the default
selectivity of 10% applied for each HAVING predicate. This patch skip
tuple-based optimization if AggregationNode.conjuncts_ is ever not
empty. It will stay skipped on stats recompute, even if conjuncts_ is
transfered into the next Merge AggregationNode above the plan. The
optimization skip causes following PlannerTest (under
testdata/workloads/functional-planner/queries/PlannerTest/) to revert
their cardinality estimation to their state pior to IMPALA-13405:
- tpcds/tpcds-q39a.test
- tpcds/tpcds-q39b.test
- tpcds_cpu_cost/tpcds-q39a.test
- tpcds_cpu_cost/tpcds-q39b.test
In the future, we should consider raising the default selectivity for
HAVING predicate and undo this skipping logic (IMPALA-13542).
Third, is missing stats recompute after conjunct transfer in multi-phase
aggregation. This will be fixed separately by IMPALA-13526.
Testing:
- Enable cardinality validation in testMultipleDistinct*
- Update aggregation.test to reflect current PlannerTest output.
Added some test cases in aggregation.test.
- Run and pass TpcdsPlannerTest and TpcdsCpuPlannerTest.
- Selectively run some more planner tests that touch AggregationNode and
pass them.
Change-Id: Iadb4af9fd65fdb85b66fae1e403ccec8ca5eb102
Reviewed-on: http://gerrit.cloudera.org:8080/22184
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Lower AggregationNode cardinality by analyzing estimate of source Tuple
> -----------------------------------------------------------------------
>
> Key: IMPALA-13405
> URL: https://issues.apache.org/jira/browse/IMPALA-13405
> Project: IMPALA
> Issue Type: Improvement
> Components: Frontend
> Affects Versions: Impala 4.4.0
> Reporter: Riza Suminto
> Assignee: Riza Suminto
> Priority: Major
> Fix For: Impala 4.5.0
>
>
> If an aggregation node has multiple grouping expressions that originate from
> the same tuple, then their combined NDV must not exceed output cardinality of
> PlanNode producing that tuple. Take example of this PARALLELPLANS from
> [Q31|https://github.com/apache/impala/blob/101e10b/testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q31.test].
>
> {code:java}
> | 11:AGGREGATE [STREAMING]
> | | output: sum(ss_ext_sales_price)
> | | group by: ca_county, d_qoy, d_year
> | | mem-estimate=84.55MB mem-reservation=34.00MB spill-buffer=2.00MB
> thread-reservation=0
> | | tuple-ids=8 row-size=50B cardinality=1.43M cost=1948896250
> | | in pipelines: 06(GETNEXT)
> ....
> | | 07:SCAN HDFS [tpcds_partitioned_parquet_snap.date_dim, RANDOM]
> | | HDFS partitions=1/1 files=1 size=2.17MB
> | | predicates: tpcds_partitioned_parquet_snap.date_dim.d_year =
> CAST(1998 AS INT), tpcds_partitioned_parquet_snap.date_dim.d_qoy = CAST(2 AS
> INT)
> | | stored statistics:
> | | table: rows=73.05K size=2.17MB
> | | columns: all
> | | extrapolated-rows=disabled max-scan-range-rows=73.05K
> | | parquet statistics predicates:
> tpcds_partitioned_parquet_snap.date_dim.d_year = CAST(1998 AS INT),
> tpcds_partitioned_parquet_snap.date_dim.d_qoy = CAST(2 AS INT)
> | | parquet dictionary predicates:
> tpcds_partitioned_parquet_snap.date_dim.d_year = CAST(1998 AS INT),
> tpcds_partitioned_parquet_snap.date_dim.d_qoy = CAST(2 AS INT)
> | | mem-estimate=16.00MB mem-reservation=512.00KB thread-reservation=0
> | | tuple-ids=6 row-size=12B cardinality=186 cost=16728
> | | in pipelines: 07(GETNEXT) {code}
>
> Cardinality estimate of 11:AGGREGATE comes from this calculation:
> {code:java}
> est_cardinality(11:AGG) = NDV(ca_county) * NDV(d_qoy) * NDV (d_year)
> = 1825 * 4 * 196
> = 1430800{code}
> However, d_qoy and d_year belong to the same TupleId 6 coming out from
> 07:SCAN, so its cardinality can be estimated lower to this:
> {code:java}
> est_cardinality(11:AGG) = NDV(ca_county) * est_cardinality(07:SCAN)
> = 1825 * 186
> = 339450{code}
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]