[
https://issues.apache.org/jira/browse/IMPALA-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621434#comment-16621434
]
Paul Rogers edited comment on IMPALA-7310 at 9/20/18 3:02 AM:
--------------------------------------------------------------
Odd. Looked at the tests in {{ExprNdvTest}}. We have tests such as:
{noformat}
verifyNdv("case when id = 1 then 'yes' else 'no' end", 2);
{noformat}
This says that we convert the rand of values in column {{id}} into two values:
{{'yes'}} and {{'no'}}. So, the NDV of the expression is 2. Solid reasoning.
Then I added a new test using only constants:
{noformat}
verifyNdv("case when 0 = 1 then 'yes' else 'no' end", 2);
{noformat}
In this case, the NDV is reported as 2, when it should be 1 (the expression is
constant.)
Yes, this is a nit and few users will create such SQL. Still, I wonder if this
suggests other inconsistencies in NDV & selectivity handling?
Also, we seem to overuse "undefined":
{noformat}
verifyNdvTwoTable("case when id = 1 then date_string_col else tiny.b end",
-1);
{noformat}
This says that, depending on column {{id}} (which has stats), use either
{{date_string_col}} (which has stats) or {{tiny.b}} which does not. The result
is no NDV.
But, we know that the NDV is at least 736 (the value for {{date_string_col}}).
So, choosing -1 is throwing away data.
was (Author: paul.rogers):
Odd. Looked at the tests in {{ExprNdvTest}}. We have tests such as:
{noformat}
verifyNdv("case when id = 1 then 'yes' else 'no' end", 2);
{noformat}
This says that we convert the rand of values in column {{id}} into two values:
{{'yes'}} and {{'no'}}. So, the NDV of the expression is 2. Solid reasoning.
Then I added a new test using only constants:
{noformat}
verifyNdv("case when 0 = 1 then 'yes' else 'no' end", 2);
{noformat}
In this case, the NDV is reported as 2, when it should be 1 (the expression is
constant.)
Yes, this is a nit and few users will create such SQL. Still, I wonder if this
suggests other inconsistencies in NDV & selectivity handling?
> Compute Stats not computing NULLs as a distinct value causing wrong estimates
> -----------------------------------------------------------------------------
>
> Key: IMPALA-7310
> URL: https://issues.apache.org/jira/browse/IMPALA-7310
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Affects Versions: Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0,
> Impala 2.11.0, Impala 3.0, Impala 2.12.0
> Reporter: Zsombor Fedor
> Assignee: Paul Rogers
> Priority: Major
>
> As seen in other DBMSs
> {code:java}
> NDV(col){code}
> not counting NULL as a distinct value. The same also applies to
> {code:java}
> COUNT(DISTINCT col){code}
> This is working as intended, but when computing column statistics it can
> cause some anomalies (i.g. bad join order) as compute stats uses NDV() to
> determine columns NDVs.
>
> For example when aggregating more columns, the estimated cardinality is
> [counted as the product of the columns' number of distinct
> values.|https://github.com/cloudera/Impala/blob/64cd0bb0c3529efa0ab5452c4e9e2a04fd815b4f/fe/src/main/java/org/apache/impala/analysis/Expr.java#L669]
> If there is a column full of NULLs the whole product will be 0.
>
> There are two possible fix for this.
> Either we should count NULLs as a distinct value when Computing Stats in the
> query:
> {code:java}
> SELECT NDV(a) + COUNT(DISTINCT CASE WHEN a IS NULL THEN 1 END) AS a, CAST(-1
> as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
> instead of
> {code:java}
> SELECT NDV(a) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
>
>
> Or we should change the planner
> [function|https://github.com/cloudera/Impala/blob/2d2579cb31edda24457d33ff5176d79b7c0432c5/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L169]
> to take care of this bug.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]