[jira] [Comment Edited] (IMPALA-7310) Compute Stats not computing NULLs as a distinct value causing wrong estimates

Paul Rogers (JIRA) Thu, 20 Sep 2018 16:38:09 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622856#comment-16622856
 ]


Paul Rogers edited comment on IMPALA-7310 at 9/20/18 11:36 PM:
---------------------------------------------------------------

The planner uses NDVs to make binary decisions: do I do x or y? (Do I put t1 on 
the build side of a join, or to I put it on the probe site?) In most cases, the 
values being compared are order-of-magnitude different, and so fine nuances of 
value are not important. We simply need some reasonable non-zero number so that 
the calcs can play out.

The simplest fix is to handle the non-stats case for a 
[{{ColumnStats}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java]
 instance.

The current code in {{initColStats()}} initializes NDV to -1 (undefined). 
Suggested alternatives:
 * If type is Boolean, NDV = 2
 * If type is TINYINT, NDV = 256.
 * If type is anything else, assume NDV = some constant, say 1000.

Note that the above logic actually already exists in 
{{createHiveColStatsData()}} where it is used to bound the NDV value. So, we 
just reuse it. That code also suggests we can NDV at row count. But, since our 
guesses are small, the row count might not add much value. Or, our NDV guess 
might be some fraction of row count, if we want to be fancy.

{{ColumnStats}} already has a {{hasStats()}} method which checks if NDV is 
other than -1. Since NDV will always not be some value, change this method to 
check only {{numNulls_}}, which will continue to be -1 without stats.

Finally, in {{createHiveColStatsData}}, set a floor on NDV at 1 to account for 
the fact that an all-null column has HDV=0. Or, to be conservative, if NDV < 
10, add one to NDV to account for nulls.

Next, modify {{update()}} to use the defaults (to be set in {{initColStats()}} 
for the "incompatible" case.

As a result, when the plan nodes ask for NDV, they won't get a 0 value if we 
have no data, nor will they get 0 if a column is all nulls.

Add or modify unit tests to verify the above logic, especially the defaults 
case and how the defaults propagate up the plan tree.

The risk is that some plans will change. We hope they change to favor getting 
the correct plan more often. But, there will be some use case for which the 
old, wrong, values produced a more accurate plan than the new estimates. This 
is always a risk.


was (Author: paul.rogers):
The planner uses NDVs to make binary decisions: do I do x or y? (Do I put t1 on 
the build side of a join, or to I put it on the probe site?) In most cases, the 
values being compared are order-of-magnitude different, and so fine nuances of 
value are not important. We simply need some reasonable non-zero number so that 
the calcs can play out.

The simplest fix is to handle the non-stats case for a 
[{{ColumnStats}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java]
 instance.

The current code in {{initColStats()}} initializes NDV to -1 (undefined). 
Suggested alternatives:

 * If type is Boolean, NDV = 2
 * If type is TINYINT, NDV = 256.
 * If type is anything else, assume NDV = some constant, say 1000.

Note that the above logic actually already exists in 
{{createHiveColStatsData()}} where it is used to bound the NDV value. So, we 
just reuse it. That code also suggests we can NDV at row count. But, since our 
guesses are small, the row count might not add much value. Or, our NDV guess 
might be some fraction of row count, if we want to be fancy.

Then, add a flag (if some useful field does not exist) to indicate if the NDV 
is from stats or estimates. (Will be useful in computing selectivity later.)

Finally, in {{createHiveColStatsData}}, set a floor on NDV at 1 to account for 
the fact that an all-null column has HDV=0. Or, to be conservative, if NDV < 
10, add one to NDV to account for nulls.

Next, modify {{update()}} to use the defaults (to be set in {{initColStats()}} 
for the "incompatible" case.

As a result, when the plan nodes ask for NDV, they won't get a 0 value if we 
have no data, nor will they get 0 if a column is all nulls.

Add or modify unit tests to verify the above logic, especially the defaults 
case and how the defaults propagate up the plan tree.

The risk is that some plans will change. We hope they change to favor getting 
the correct plan more often. But, there will be some use case for which the 
old, wrong, values produced a more accurate plan than the new estimates. This 
is always a risk.

> Compute Stats not computing NULLs as a distinct value causing wrong estimates
> -----------------------------------------------------------------------------
>
>                 Key: IMPALA-7310
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7310
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, 
> Impala 2.11.0, Impala 3.0, Impala 2.12.0
>            Reporter: Zsombor Fedor
>            Assignee: Paul Rogers
>            Priority: Major
>
> As seen in other DBMSs
> {code:java}
> NDV(col){code}
> not counting NULL as a distinct value. The same also applies to
> {code:java}
> COUNT(DISTINCT col){code}
> This is working as intended, but when computing column statistics it can 
> cause some anomalies (i.g. bad join order) as compute stats uses NDV() to 
> determine columns NDVs.
>  
> For example when aggregating more columns, the estimated cardinality is 
> [counted as the product of the columns' number of distinct 
> values.|https://github.com/cloudera/Impala/blob/64cd0bb0c3529efa0ab5452c4e9e2a04fd815b4f/fe/src/main/java/org/apache/impala/analysis/Expr.java#L669]
>  If there is a column full of NULLs the whole product will be 0.
>  
> There are two possible fix for this.
> Either we should count NULLs as a distinct value when Computing Stats in the 
> query:
> {code:java}
> SELECT NDV(a) + COUNT(DISTINCT CASE WHEN a IS NULL THEN 1 END) AS a, CAST(-1 
> as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
> instead of
> {code:java}
> SELECT NDV(a) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
>  
>  
> Or we should change the planner 
> [function|https://github.com/cloudera/Impala/blob/2d2579cb31edda24457d33ff5176d79b7c0432c5/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L169]
>  to take care of this bug.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (IMPALA-7310) Compute Stats not computing NULLs as a distinct value causing wrong estimates

Reply via email to