[
https://issues.apache.org/jira/browse/IMPALA-13644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17914377#comment-17914377
]
ASF subversion and git services commented on IMPALA-13644:
----------------------------------------------------------
Commit 3118e41c26f730a06d42994e678cab694c787649 in impala's branch
refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=3118e41c2 ]
IMPALA-2945: Account for duplicate keys on multiple nodes preAgg
AggregationNode.computeStats() estimate cardinality under single node
assumption. This can be an underestimation in preaggregation node case
because same grouping key may exist in multiple nodes during
preaggreation.
This patch adjust the cardinality estimate using following model for the
number of distinct values in a random sample of k rows, previously used
to calculate ProcessingCost model by IMPALA-12657 and IMPALA-13644.
Assumes we are picking k rows from an infinite sample with ndv distinct
values, with the value uniformly distributed. The probability of a given
value not appearing in a sample, in that case is
((NDV - 1) / NDV) ^ k
This is because we are making k choices, and each of them has
(ndv - 1) / ndv chance of not being our value. Therefore the
probability of a given value appearing in the sample is:
1 - ((NDV - 1) / NDV) ^ k
And the number of distinct values in the sample is:
(1 - ((NDV - 1) / NDV) ^ k) * NDV
Query option ESTIMATE_DUPLICATE_IN_PREAGG is added to control whether to
use the new estimation logic or not.
Testing:
- Pass core tests.
Change-Id: I04c563e59421928875b340cb91654b9d4bc80b55
Reviewed-on: http://gerrit.cloudera.org:8080/22047
Reviewed-by: Riza Suminto <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Generalize and move getPerInstanceNdvForCpuCosting into AggregationNode.
> ------------------------------------------------------------------------
>
> Key: IMPALA-13644
> URL: https://issues.apache.org/jira/browse/IMPALA-13644
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Affects Versions: Impala 4.4.0
> Reporter: Riza Suminto
> Assignee: Riza Suminto
> Priority: Major
> Fix For: Impala 4.5.0
>
>
> getPerInstanceNdvForCpuCosting is a method to estimate the number of distinct
> values of exprs per fragment instance when accounting for the likelihood of
> duplicate keys across fragment instances. It borrows the probabilistic model
> from formula described in IMPALA-2945. This method is exclusively used by
> AggregationNode only.
> [https://github.com/apache/impala/blob/99529db6ad62ddc34cbfd924d7e41b1fce5b60a2/fe/src/main/java/org/apache/impala/planner/PlanFragment.java#L634-L642]
>
> We should move this method to AggregationNode and generalize it to accept NDV
> estimate calculated at AggregationNode.computeStats() as input. The number
> from computeStats should be more precise now after improvement from
> IMPALA-13405, IMPALA-13526, and IMPALA-13622.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]