[jira] [Comment Edited] (IMPALA-7310) Compute Stats not computing NULLs as a distinct value causing wrong estimates

Paul Rogers (JIRA) Wed, 19 Sep 2018 18:52:03 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621096#comment-16621096
 ]


Paul Rogers edited comment on IMPALA-7310 at 9/20/18 1:50 AM:
--------------------------------------------------------------

Simplest case: a binary predicate. Current behavior in 
{{BinaryPredicate.analyzeImpl()}}:

{noformat}
      long distinctValues = slotRefRef.getRef().getNumDistinctValues();
      if (distinctValues > 0) {
        selectivity_ = 1.0 / distinctValues;
        selectivity_ = Math.max(0, Math.min(1, selectivity_));
      }
{noformat}

So, if NDV is 0 (all nulls), selectivity is left at its default (-1, which 
seems to mean "undefined".)

There are really two cases here:

1. We have stats. The 0 for NDV tells us that we either have 0 rows, or all 
rows are null. In this case, we can safely assume that the NDV-with-nulls is 1.
2. We have no stats. NDV is unknown. But, we can use our default estimate in 
{{Expr.DEFAULT_SELECTIVITY = 0.1}} to compute an a-priority NDV estimate of 
1/selectivity = 10. (Or, more directly, use {{DEFAULT_SELECTIVITY}} as the 
expression selectivity.)

To even consider such a change, we'd need to test the various paths. The only 
test I could find for expressions and NDV are {{ExprNdvTest}} which tests with 
and without stats, but not with an NDV=0 (as far as I can tell.)

Note that this test case uses a different definition for NDV than the 
"distinct-non-null-values" definition used by the {{NDV\(x)}} function:

{noformat}
    // When else not specified, it is NULL, verify it is counted
    verifyNdv("case when id = 1 then 'yes' end", 2);
{noformat}

And, in {{Predicate.analyzeImpl()}}:

{noformat}
    // values: true/false/null
    numDistinctValues_ = 3;
{noformat}



was (Author: paul.rogers):
Simplest case: a binary predicate. Current behavior in 
{{BinaryPredicate.analyzeImpl()}}:

{noformat}
      long distinctValues = slotRefRef.getRef().getNumDistinctValues();
      if (distinctValues > 0) {
        selectivity_ = 1.0 / distinctValues;
        selectivity_ = Math.max(0, Math.min(1, selectivity_));
      }
{noformat}

So, if NDV is 0 (all nulls), selectivity is left at its default (-1, which 
seems to mean "undefined".)

There are really two cases here:

1. We have stats. The 0 for NDV tells us that we either have 0 rows, or all 
rows are null. In this case, we can safely assume that the NDV-with-nulls is 1.
2. We have no stats. NDV is unknown. But, we can use our default estimate in 
{{Expr.DEFAULT_SELECTIVITY = 0.1}} to compute an a-priority NDV estimate of 
1/selectivity = 10. (Or, more directly, use {{DEFAULT_SELECTIVITY}} as the 
expression selectivity.)

To even consider such a change, we'd need to test the various paths. The only 
test I could find for expressions and NDV are {{ExprNdvTest}} which tests with 
and without stats, but not with an NDV=0 (as far as I can tell.)

Note that this test case uses a different definition for NDV than the 
"distinct-non-null-values" definition used by the {{NDV\(x)}} function:

{noformat}
    // When else not specified, it is NULL, verify it is counted
    verifyNdv("case when id = 1 then 'yes' end", 2);
{noformat}

> Compute Stats not computing NULLs as a distinct value causing wrong estimates
> -----------------------------------------------------------------------------
>
>                 Key: IMPALA-7310
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7310
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, 
> Impala 2.11.0, Impala 3.0, Impala 2.12.0
>            Reporter: Zsombor Fedor
>            Assignee: Paul Rogers
>            Priority: Major
>
> As seen in other DBMSs
> {code:java}
> NDV(col){code}
> not counting NULL as a distinct value. The same also applies to
> {code:java}
> COUNT(DISTINCT col){code}
> This is working as intended, but when computing column statistics it can 
> cause some anomalies (i.g. bad join order) as compute stats uses NDV() to 
> determine columns NDVs.
>  
> For example when aggregating more columns, the estimated cardinality is 
> [counted as the product of the columns' number of distinct 
> values.|https://github.com/cloudera/Impala/blob/64cd0bb0c3529efa0ab5452c4e9e2a04fd815b4f/fe/src/main/java/org/apache/impala/analysis/Expr.java#L669]
>  If there is a column full of NULLs the whole product will be 0.
>  
> There are two possible fix for this.
> Either we should count NULLs as a distinct value when Computing Stats in the 
> query:
> {code:java}
> SELECT NDV(a) + COUNT(DISTINCT CASE WHEN a IS NULL THEN 1 END) AS a, CAST(-1 
> as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
> instead of
> {code:java}
> SELECT NDV(a) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
>  
>  
> Or we should change the planner 
> [function|https://github.com/cloudera/Impala/blob/2d2579cb31edda24457d33ff5176d79b7c0432c5/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L169]
>  to take care of this bug.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (IMPALA-7310) Compute Stats not computing NULLs as a distinct value causing wrong estimates

Reply via email to