[
https://issues.apache.org/jira/browse/IMPALA-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622856#comment-16622856
]
Paul Rogers edited comment on IMPALA-7310 at 9/26/18 7:45 PM:
--------------------------------------------------------------
The planner uses NDVs to make binary decisions: do I do x or y? (Do I put t1 on
the build side of a join, or to I put it on the probe site?) In most cases, the
values being compared are order-of-magnitude different, and so fine nuances of
value are not important. We simply need some reasonable non-zero number so that
the calcs can play out.
(Note: the following is left for the record, but the final fix is much more
narrowly tailored.)
The simplest fix is to handle the non-stats case for a
[{{ColumnStats}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java]
instance.
The current code in {{initColStats()}} initializes NDV to -1 (undefined).
Suggested alternatives:
* If type is Boolean, NDV = 2
* If type is TINYINT, NDV = 256.
* If type is anything else, assume NDV = some constant, say 1000.
Note that a variation of the above logic actually already exists in
{{createHiveColStatsData()}} where it is used to bound the NDV value. So, we
just reuse it. That code also suggests we can NDV at row count. But, since our
guesses are small, the row count might not add much value. Or, our NDV guess
might be some fraction of row count, if we want to be fancy.
{{ColumnStats}} already has a {{hasStats()}} method which checks if NDV is
other than -1. Since NDV will always not be some value, change this method to
check only {{numNulls_}}, which will continue to be -1 without stats.
Finally, in {{createHiveColStatsData}}, set a floor on NDV at 1 to account for
the fact that an all-null column has HDV=0. Or, to be conservative, if NDV <=
10, add one to NDV to account for nulls. (Do this always, since a column that
claims to be non-null can eventually become null as the result of an outer
join.)
Next, modify {{update()}} to use the defaults (to be set in {{initColStats()}}
for the "incompatible" case.
As a result, when the plan nodes ask for NDV, they won't get a 0 value if we
have no data, nor will they get 0 if a column is all nulls.
Add or modify unit tests to verify the above logic, especially the defaults
case and how the defaults propagate up the plan tree.
The risk is that some plans will change. We hope they change to favor getting
the correct plan more often. But, there will be some use case for which the
old, wrong, values produced a more accurate plan than the new estimates. This
is always a risk.
was (Author: paul.rogers):
The planner uses NDVs to make binary decisions: do I do x or y? (Do I put t1 on
the build side of a join, or to I put it on the probe site?) In most cases, the
values being compared are order-of-magnitude different, and so fine nuances of
value are not important. We simply need some reasonable non-zero number so that
the calcs can play out.
The simplest fix is to handle the non-stats case for a
[{{ColumnStats}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java]
instance.
The current code in {{initColStats()}} initializes NDV to -1 (undefined).
Suggested alternatives:
* If type is Boolean, NDV = 2
* If type is TINYINT, NDV = 256.
* If type is anything else, assume NDV = some constant, say 1000.
Note that a variation of the above logic actually already exists in
{{createHiveColStatsData()}} where it is used to bound the NDV value. So, we
just reuse it. That code also suggests we can NDV at row count. But, since our
guesses are small, the row count might not add much value. Or, our NDV guess
might be some fraction of row count, if we want to be fancy.
{{ColumnStats}} already has a {{hasStats()}} method which checks if NDV is
other than -1. Since NDV will always not be some value, change this method to
check only {{numNulls_}}, which will continue to be -1 without stats.
Finally, in {{createHiveColStatsData}}, set a floor on NDV at 1 to account for
the fact that an all-null column has HDV=0. Or, to be conservative, if NDV <=
10, add one to NDV to account for nulls. (Do this always, since a column that
claims to be non-null can eventually become null as the result of an outer
join.)
Next, modify {{update()}} to use the defaults (to be set in {{initColStats()}}
for the "incompatible" case.
As a result, when the plan nodes ask for NDV, they won't get a 0 value if we
have no data, nor will they get 0 if a column is all nulls.
Add or modify unit tests to verify the above logic, especially the defaults
case and how the defaults propagate up the plan tree.
The risk is that some plans will change. We hope they change to favor getting
the correct plan more often. But, there will be some use case for which the
old, wrong, values produced a more accurate plan than the new estimates. This
is always a risk.
> Compute Stats not computing NULLs as a distinct value causing wrong estimates
> -----------------------------------------------------------------------------
>
> Key: IMPALA-7310
> URL: https://issues.apache.org/jira/browse/IMPALA-7310
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Affects Versions: Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0,
> Impala 2.11.0, Impala 3.0, Impala 2.12.0
> Reporter: Zsombor Fedor
> Assignee: Paul Rogers
> Priority: Major
>
> As seen in other DBMSs
> {code:java}
> NDV(col){code}
> not counting NULL as a distinct value. The same also applies to
> {code:java}
> COUNT(DISTINCT col){code}
> This is working as intended, but when computing column statistics it can
> cause some anomalies (i.g. bad join order) as compute stats uses NDV() to
> determine columns NDVs.
>
> For example when aggregating more columns, the estimated cardinality is
> [counted as the product of the columns' number of distinct
> values.|https://github.com/cloudera/Impala/blob/64cd0bb0c3529efa0ab5452c4e9e2a04fd815b4f/fe/src/main/java/org/apache/impala/analysis/Expr.java#L669]
> If there is a column full of NULLs the whole product will be 0.
>
> There are two possible fix for this.
> Either we should count NULLs as a distinct value when Computing Stats in the
> query:
> {code:java}
> SELECT NDV(a) + COUNT(DISTINCT CASE WHEN a IS NULL THEN 1 END) AS a, CAST(-1
> as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
> instead of
> {code:java}
> SELECT NDV(a) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
>
>
> Or we should change the planner
> [function|https://github.com/cloudera/Impala/blob/2d2579cb31edda24457d33ff5176d79b7c0432c5/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L169]
> to take care of this bug.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]