[ 
https://issues.apache.org/jira/browse/IMPALA-8050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated IMPALA-8050:
-------------------------------
    Fix Version/s: Impala 4.0

> IS [NOT] NULL gives wrong selectivity when null count is missing
> ----------------------------------------------------------------
>
>                 Key: IMPALA-8050
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8050
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 3.1.0
>            Reporter: Paul Rogers
>            Priority: Minor
>             Fix For: Impala 4.0
>
>
> Suppose we have the following query:
> {noformat}
> select *
> from tpch.customer c
> where c.c_mktsegment is null
> {noformat}
> If we have a null count, we can estimate selectivity based on that number. In 
> the case of the TPC-H test data, after a recent fix to add null count back, 
> null count is zero so the cardinality of the predicate {{c.c_mktsegment is 
> null}} is 0 and no rows should be returned. Yet, the query plan shows:
> {noformat}
> PLAN-ROOT SINK
> |
> 00:SCAN HDFS [tpch.customer c]
>    partitions=1/1 files=1 size=23.08MB row-size=218B cardinality=15.00K
>    predicates: c.c_comment IS NULL
> {noformat}
> So, the first bug is that the existing code which is supposed to consider 
> null count (found in {{IsNullPredicate.analyzeImpl()}} does not work. Reason: 
> the code in {{ColumnStats}} to check if we have nulls is wrong:
> {code:java}
>   public boolean hasNulls() { return numNulls_ > 0; }
> {code}
> Zero is a perfectly valid null count: it means a NOT NULL column. The marker 
> for a missing null count is -1 as shown in another method:
> {code:java}
>   public boolean hasStats() { return numNulls_ != -1 || numDistinctValues_ != 
> -1; }
> {code}
> This is probably an ambiguity in the name: does "has nulls" mean:
> * Do we have valid null count stats?
> * Do we have null count stats and we have at least some nulls?
> Fortunately, the only other use of this method is in (disabled) tests.
> h4. Handle Missing Null Counts
> Second, if the null count is not available (for older stats), the next-best 
> approximation is 1/NDV. The code currently guesses 0.1. The 0.1 estimate is 
> fine if NDV is not available either.
> Note that to properly test some of these cases requires new tables in the 
> test suite with no or partial stats.
> h4. Special Consideration for Outer Joins
> When this predicate is applied to the result of an outer join, the estimation 
> methods above *will not* work. Using the table null count to estimate an 
> outer join null count is clearly wrong, as is using the table NDV value. The 
> fall-back of .1 will tend to under-estimate an outer join.
> Instead, what is needed is a more complex estimate. Assume a left outer join 
> (all rows from left, plus matching rows from right.)
> {noformat}
> |join| = |left 𝜎 key is not null| * |right|/|key| + |left 𝜎 key is null|
> {noformat}
> So, to estimate {{IS NULL}} or {{IS NOT NULL}} after an outer join must use a 
> different algorithm then when estimating it in a scan.
> This suggests that expression selectivity is not an independent exercise as 
> the code currently assumes it is. Instead, it must be aware of its context. 
> In this case, the underlying null count for the column in the predicate must 
> be adjusted when used in an outer join.
> The following TPC-H query gives a very clear example (see {{card-join.test}}):
> {code:sql}
> select c.c_custkey, o.o_orderkey
> from tpch.customer c
> left outer join tpch.orders o on c.c_custkey = o.o_custkey
> where o.o_clerk is null
> {code}
> The plan, with the {{IS NULL}} filter applied twice (correct structure, wrong 
> cardinality estimate):
> {noformat}
> PLAN-ROOT SINK
> |
> 02:HASH JOIN [RIGHT OUTER JOIN]
> |  hash predicates: o.o_custkey = c.c_custkey
> |  other predicates: o.o_clerk IS NULL
> |  runtime filters: RF000 <- c.c_custkey
> |  row-size=51B cardinality=0
> |
> |--00:SCAN HDFS [tpch.customer c]
> |     partitions=1/1 files=1 size=23.08MB row-size=8B cardinality=150.00K
> |
> 01:SCAN HDFS [tpch.orders o]
>    partitions=1/1 files=1 size=162.56MB row-size=43B cardinality=1.50M
>    runtime filters: RF000 -> o.o_custkey
> {noformat}
> The math:
> * The query obtains all customer rows, {{|customer| = 150K}}.
> * The query obtains all order rows where the clerk is null, which is none.
> * The query then left outer joins the customer table with orders. Since only 
> 100K customers have orders, 50K do not. The result would be a join with 50K 
> null clerks.
> * But, because the {{IS NULL}} calculations after the join consider only the 
> {{orders}} null count, all the other rows are assumed discarded.
> So, it may be that the bug earlier, which accidentally turned off {{IS NULL}} 
> handling, actually makes this particular use case work better...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to