[
https://issues.apache.org/jira/browse/IMPALA-8050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aman Sinha updated IMPALA-8050:
-------------------------------
Fix Version/s: Impala 4.0
> IS [NOT] NULL gives wrong selectivity when null count is missing
> ----------------------------------------------------------------
>
> Key: IMPALA-8050
> URL: https://issues.apache.org/jira/browse/IMPALA-8050
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Affects Versions: Impala 3.1.0
> Reporter: Paul Rogers
> Priority: Minor
> Fix For: Impala 4.0
>
>
> Suppose we have the following query:
> {noformat}
> select *
> from tpch.customer c
> where c.c_mktsegment is null
> {noformat}
> If we have a null count, we can estimate selectivity based on that number. In
> the case of the TPC-H test data, after a recent fix to add null count back,
> null count is zero so the cardinality of the predicate {{c.c_mktsegment is
> null}} is 0 and no rows should be returned. Yet, the query plan shows:
> {noformat}
> PLAN-ROOT SINK
> |
> 00:SCAN HDFS [tpch.customer c]
> partitions=1/1 files=1 size=23.08MB row-size=218B cardinality=15.00K
> predicates: c.c_comment IS NULL
> {noformat}
> So, the first bug is that the existing code which is supposed to consider
> null count (found in {{IsNullPredicate.analyzeImpl()}} does not work. Reason:
> the code in {{ColumnStats}} to check if we have nulls is wrong:
> {code:java}
> public boolean hasNulls() { return numNulls_ > 0; }
> {code}
> Zero is a perfectly valid null count: it means a NOT NULL column. The marker
> for a missing null count is -1 as shown in another method:
> {code:java}
> public boolean hasStats() { return numNulls_ != -1 || numDistinctValues_ !=
> -1; }
> {code}
> This is probably an ambiguity in the name: does "has nulls" mean:
> * Do we have valid null count stats?
> * Do we have null count stats and we have at least some nulls?
> Fortunately, the only other use of this method is in (disabled) tests.
> h4. Handle Missing Null Counts
> Second, if the null count is not available (for older stats), the next-best
> approximation is 1/NDV. The code currently guesses 0.1. The 0.1 estimate is
> fine if NDV is not available either.
> Note that to properly test some of these cases requires new tables in the
> test suite with no or partial stats.
> h4. Special Consideration for Outer Joins
> When this predicate is applied to the result of an outer join, the estimation
> methods above *will not* work. Using the table null count to estimate an
> outer join null count is clearly wrong, as is using the table NDV value. The
> fall-back of .1 will tend to under-estimate an outer join.
> Instead, what is needed is a more complex estimate. Assume a left outer join
> (all rows from left, plus matching rows from right.)
> {noformat}
> |join| = |left 𝜎 key is not null| * |right|/|key| + |left 𝜎 key is null|
> {noformat}
> So, to estimate {{IS NULL}} or {{IS NOT NULL}} after an outer join must use a
> different algorithm then when estimating it in a scan.
> This suggests that expression selectivity is not an independent exercise as
> the code currently assumes it is. Instead, it must be aware of its context.
> In this case, the underlying null count for the column in the predicate must
> be adjusted when used in an outer join.
> The following TPC-H query gives a very clear example (see {{card-join.test}}):
> {code:sql}
> select c.c_custkey, o.o_orderkey
> from tpch.customer c
> left outer join tpch.orders o on c.c_custkey = o.o_custkey
> where o.o_clerk is null
> {code}
> The plan, with the {{IS NULL}} filter applied twice (correct structure, wrong
> cardinality estimate):
> {noformat}
> PLAN-ROOT SINK
> |
> 02:HASH JOIN [RIGHT OUTER JOIN]
> | hash predicates: o.o_custkey = c.c_custkey
> | other predicates: o.o_clerk IS NULL
> | runtime filters: RF000 <- c.c_custkey
> | row-size=51B cardinality=0
> |
> |--00:SCAN HDFS [tpch.customer c]
> | partitions=1/1 files=1 size=23.08MB row-size=8B cardinality=150.00K
> |
> 01:SCAN HDFS [tpch.orders o]
> partitions=1/1 files=1 size=162.56MB row-size=43B cardinality=1.50M
> runtime filters: RF000 -> o.o_custkey
> {noformat}
> The math:
> * The query obtains all customer rows, {{|customer| = 150K}}.
> * The query obtains all order rows where the clerk is null, which is none.
> * The query then left outer joins the customer table with orders. Since only
> 100K customers have orders, 50K do not. The result would be a join with 50K
> null clerks.
> * But, because the {{IS NULL}} calculations after the join consider only the
> {{orders}} null count, all the other rows are assumed discarded.
> So, it may be that the bug earlier, which accidentally turned off {{IS NULL}}
> handling, actually makes this particular use case work better...
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]