[
https://issues.apache.org/jira/browse/IMPALA-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626483#comment-16626483
]
Paul Rogers edited comment on IMPALA-7310 at 9/26/18 7:43 PM:
--------------------------------------------------------------
Final solution is even simpler, since we don't want to change anything except
handling of a table with stats, and a column of all nulls. So we apply the
adjustment only if:
* The column comes from a base table (not an internal column such as
{{COUNT(*)}}
* The table has stats
* The column in question is nullable
* The column either has no null count, or the null count is non-zero
* The NDV without nulls is 0 or 1
In this very limited case, we bump the NDV by 1.
As it turns out, the TPC-H test cases have several queries in which a table
column is nullable, has only one or two values, and those columns are clearly
meant to be non-null. The above fix works around these cases so that such cases
don't cause large changes to the results for {{PlannerTest}}.
was (Author: paul.rogers):
Final solution is even simpler, since we don't want to change anything except
handling of a table with stats, and a column of all nulls. If a table has
stats, and if NDV <= 10, then add one to it to account for the fact that the
planner defines NDV to include null values, while the NDV function (and thus
stats) excludes null from the NDV.
Above 10, the difference of including null becomes minor, so to minimize the
impact of the change, we phase out the adjustments once NDV gets above 10.
> Compute Stats not computing NULLs as a distinct value causing wrong estimates
> -----------------------------------------------------------------------------
>
> Key: IMPALA-7310
> URL: https://issues.apache.org/jira/browse/IMPALA-7310
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Affects Versions: Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0,
> Impala 2.11.0, Impala 3.0, Impala 2.12.0
> Reporter: Zsombor Fedor
> Assignee: Paul Rogers
> Priority: Major
>
> As seen in other DBMSs
> {code:java}
> NDV(col){code}
> not counting NULL as a distinct value. The same also applies to
> {code:java}
> COUNT(DISTINCT col){code}
> This is working as intended, but when computing column statistics it can
> cause some anomalies (i.g. bad join order) as compute stats uses NDV() to
> determine columns NDVs.
>
> For example when aggregating more columns, the estimated cardinality is
> [counted as the product of the columns' number of distinct
> values.|https://github.com/cloudera/Impala/blob/64cd0bb0c3529efa0ab5452c4e9e2a04fd815b4f/fe/src/main/java/org/apache/impala/analysis/Expr.java#L669]
> If there is a column full of NULLs the whole product will be 0.
>
> There are two possible fix for this.
> Either we should count NULLs as a distinct value when Computing Stats in the
> query:
> {code:java}
> SELECT NDV(a) + COUNT(DISTINCT CASE WHEN a IS NULL THEN 1 END) AS a, CAST(-1
> as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
> instead of
> {code:java}
> SELECT NDV(a) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
>
>
> Or we should change the planner
> [function|https://github.com/cloudera/Impala/blob/2d2579cb31edda24457d33ff5176d79b7c0432c5/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L169]
> to take care of this bug.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]