[
https://issues.apache.org/jira/browse/IMPALA-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622239#comment-16622239
]
Tim Armstrong commented on IMPALA-7310:
---------------------------------------
WRT to throwing away data, the philosophy in the Impala planner is generally
not to try to hard to infer statistics if "proper" statistics are missing. E.g.
we don't try to estimate row count from file size, so the planner can make
"obviously dumb" decisions with join ordering in the absence of stats.
I think the motivation was to avoid investing too much effort (implementation
and maintenance) to handle cases where best practices weren't followed, and
also as a way to discourage bad practices.
I don't know if this is the right decision, but there's some sense to it.
> Compute Stats not computing NULLs as a distinct value causing wrong estimates
> -----------------------------------------------------------------------------
>
> Key: IMPALA-7310
> URL: https://issues.apache.org/jira/browse/IMPALA-7310
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Affects Versions: Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0,
> Impala 2.11.0, Impala 3.0, Impala 2.12.0
> Reporter: Zsombor Fedor
> Assignee: Paul Rogers
> Priority: Major
>
> As seen in other DBMSs
> {code:java}
> NDV(col){code}
> not counting NULL as a distinct value. The same also applies to
> {code:java}
> COUNT(DISTINCT col){code}
> This is working as intended, but when computing column statistics it can
> cause some anomalies (i.g. bad join order) as compute stats uses NDV() to
> determine columns NDVs.
>
> For example when aggregating more columns, the estimated cardinality is
> [counted as the product of the columns' number of distinct
> values.|https://github.com/cloudera/Impala/blob/64cd0bb0c3529efa0ab5452c4e9e2a04fd815b4f/fe/src/main/java/org/apache/impala/analysis/Expr.java#L669]
> If there is a column full of NULLs the whole product will be 0.
>
> There are two possible fix for this.
> Either we should count NULLs as a distinct value when Computing Stats in the
> query:
> {code:java}
> SELECT NDV(a) + COUNT(DISTINCT CASE WHEN a IS NULL THEN 1 END) AS a, CAST(-1
> as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
> instead of
> {code:java}
> SELECT NDV(a) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
>
>
> Or we should change the planner
> [function|https://github.com/cloudera/Impala/blob/2d2579cb31edda24457d33ff5176d79b7c0432c5/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L169]
> to take care of this bug.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]