[
https://issues.apache.org/jira/browse/IMPALA-13405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17909056#comment-17909056
]
ASF subversion and git services commented on IMPALA-13405:
----------------------------------------------------------
Commit b1628d8644d21c3b45acd8b2715a284cf5e8379b in impala's branch
refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=b1628d864 ]
IMPALA-13622: Fix negative cardinality bug in AggregationNode.java
An incomplete COMPUTE STATS during data loading reveal a bug in
AggregationNode.java where estimateNumGroups() can return value less
than -1.
This patch fix the bug by implementing
PlanNode.smallestValidCardinality() and
MathUtil.saturatingMultiplyCardinalities(). Both function validates that
the function arguments are valid cardinality number.
smallestValidCardinality() correctly compares two cardinality numbers
and return the smallest and valid one. It generalizes and replaces
static function PlanNode.capCardinalityAtLimit().
saturatingMultiplyCardinalities() adds validation and normalization over
MathUtil.saturatingMultiply().
Reorder logic of tuple-based estimation from IMPALA-13405 such that
negative estimate is handled properly.
Testing:
- Added more preconditions in AgggregationNode.java.
- Added CardinalityTest.testSmallestValidCardinality and
MathUtilTest.testSaturatingMultiplyCardinality.
- Added test in resource-requirements.test that will consistently fail
without this fix.
- Pass testResourceRequirement.
Change-Id: Ib862a010b2094daa2cbdd5d555e46443009672ad
Reviewed-on: http://gerrit.cloudera.org:8080/22235
Reviewed-by: Impala Public Jenkins <[email protected]>
Reviewed-by: Jason Fehr <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Lower AggregationNode cardinality by analyzing estimate of source Tuple
> -----------------------------------------------------------------------
>
> Key: IMPALA-13405
> URL: https://issues.apache.org/jira/browse/IMPALA-13405
> Project: IMPALA
> Issue Type: Improvement
> Components: Frontend
> Affects Versions: Impala 4.4.0
> Reporter: Riza Suminto
> Assignee: Riza Suminto
> Priority: Major
> Fix For: Impala 4.5.0
>
>
> If an aggregation node has multiple grouping expressions that originate from
> the same tuple, then their combined NDV must not exceed output cardinality of
> PlanNode producing that tuple. Take example of this PARALLELPLANS from
> [Q31|https://github.com/apache/impala/blob/101e10b/testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q31.test].
>
> {code:java}
> | 11:AGGREGATE [STREAMING]
> | | output: sum(ss_ext_sales_price)
> | | group by: ca_county, d_qoy, d_year
> | | mem-estimate=84.55MB mem-reservation=34.00MB spill-buffer=2.00MB
> thread-reservation=0
> | | tuple-ids=8 row-size=50B cardinality=1.43M cost=1948896250
> | | in pipelines: 06(GETNEXT)
> ....
> | | 07:SCAN HDFS [tpcds_partitioned_parquet_snap.date_dim, RANDOM]
> | | HDFS partitions=1/1 files=1 size=2.17MB
> | | predicates: tpcds_partitioned_parquet_snap.date_dim.d_year =
> CAST(1998 AS INT), tpcds_partitioned_parquet_snap.date_dim.d_qoy = CAST(2 AS
> INT)
> | | stored statistics:
> | | table: rows=73.05K size=2.17MB
> | | columns: all
> | | extrapolated-rows=disabled max-scan-range-rows=73.05K
> | | parquet statistics predicates:
> tpcds_partitioned_parquet_snap.date_dim.d_year = CAST(1998 AS INT),
> tpcds_partitioned_parquet_snap.date_dim.d_qoy = CAST(2 AS INT)
> | | parquet dictionary predicates:
> tpcds_partitioned_parquet_snap.date_dim.d_year = CAST(1998 AS INT),
> tpcds_partitioned_parquet_snap.date_dim.d_qoy = CAST(2 AS INT)
> | | mem-estimate=16.00MB mem-reservation=512.00KB thread-reservation=0
> | | tuple-ids=6 row-size=12B cardinality=186 cost=16728
> | | in pipelines: 07(GETNEXT) {code}
>
> Cardinality estimate of 11:AGGREGATE comes from this calculation:
> {code:java}
> est_cardinality(11:AGG) = NDV(ca_county) * NDV(d_qoy) * NDV (d_year)
> = 1825 * 4 * 196
> = 1430800{code}
> However, d_qoy and d_year belong to the same TupleId 6 coming out from
> 07:SCAN, so its cardinality can be estimated lower to this:
> {code:java}
> est_cardinality(11:AGG) = NDV(ca_county) * est_cardinality(07:SCAN)
> = 1825 * 186
> = 339450{code}
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]