Paul Rogers created IMPALA-7604:
-----------------------------------
Summary: In AggregationNode.computeStats, handle cardinality
overflow better
Key: IMPALA-7604
URL: https://issues.apache.org/jira/browse/IMPALA-7604
Project: IMPALA
Issue Type: Improvement
Affects Versions: Impala 2.12.0
Reporter: Paul Rogers
Consider the cardinality overflow logic inĀ
[{{AggregationNode.computeStats()}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/AggregationNode.java].
Current code:
{noformat}
// if we ended up with an overflow, the estimate is certain to be wrong
if (cardinality_ < 0) cardinality_ = -1;
{noformat}
This code has a number of issues.
* The check is done after looping over all conjuncts. It could be that, as a
result, the number overflowed twice. The check should be done after each
multiplication.
* Since we know that the number overflowed, a better estimate of the total
count is {{Long.MAX_VALUE}}.
* The code later checks for the -1 value and, if found, uses the cardinality of
the first child. This is a worse estimate than using the max value, since the
first child might have a low cardinality (it could be the later children that
caused the overflow.)
* If we really do expect overflow, then we are dealing with very large numbers.
Being accurate to the row is not needed. Better to use a {{double}} which can
handle the large values.
Since overflow probably seldom occurs, this is not an urgent issue. Though, if
overflow does occur, the query is huge, and having at least some estimate of
the hugeness is better than none. Also, seems that this code probably evolved;
this newbie is looking at it fresh and seeing that the accumulated fixes could
be tidied up.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]