[jira] [Updated] (IMPALA-8050) IS [NOT] NULL gives wrong selectivity when null count is missing

Paul Rogers (JIRA) Fri, 04 Jan 2019 17:58:10 -0800


     [ 
https://issues.apache.org/jira/browse/IMPALA-8050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Paul Rogers updated IMPALA-8050:
--------------------------------
    Description: 
Suppose we have the following query:

{noformat}
select *
from tpch.customer c
where c.c_mktsegment is null
{noformat}

If we have a null count, we can estimate selectivity based on that number. In 
the case of the TPC-H test data, after a recent fix to add null count back, 
null count is zero so the cardinality of the predicate {{c.c_mktsegment is 
null}} is 0 and no rows should be returned. Yet, the query plan shows:

{noformat}
PLAN-ROOT SINK
|
00:SCAN HDFS [tpch.customer c]
   partitions=1/1 files=1 size=23.08MB row-size=218B cardinality=15.00K
   predicates: c.c_comment IS NULL
{noformat}

So, the first bug is that the existing code which is supposed to consider null 
count (found in {{IsNullPredicate.analyzeImpl()}} does not work. Reason: the 
code in {{ColumnStats}} to check if we have nulls is wrong:

{code:java}
  public boolean hasNulls() { return numNulls_ > 0; }
{code}

Zero is a perfectly valid null count: it means a NOT NULL column. The marker 
for a missing null count is -1 as shown in another method:

{code:java}
  public boolean hasStats() { return numNulls_ != -1 || numDistinctValues_ != 
-1; }
{code}

This is probably an ambiguity in the name: does "has nulls" mean:

* Do we have valid null count stats?
* Do we have null count stats and we have at least some nulls?

Fortunately, the only other use of this method is in (disabled) tests.

h4. Handle Missing Null Counts

Second, if the null count is not available (for older stats), the next-best 
approximation is 1/NDV. The code currently guesses 0.1. The 0.1 estimate is 
fine if NDV is not available either.

Note that to properly test some of these cases requires new tables in the test 
suite with no or partial stats.

h4. Special Consideration for Outer Joins

When this predicate is applied to the result of an outer join, the estimation 
methods above *will not* work. Using the table null count to estimate an outer 
join null count is clearly wrong, as is using the table NDV value. The 
fall-back of .1 will tend to under-estimate an outer join.

Instead, what is needed is a more complex estimate. Assume a left outer join 
(all rows from left, plus matching rows from right.)

{noformat}
|join| = |left 𝜎 key is not null| * |right|/|key| + |left 𝜎 key is null|
{noformat}

So, to estimate {{IS NULL}} or {{IS NOT NULL}} after an outer join must use a 
different algorithm then when estimating it in a scan.

This suggests that expression selectivity is not an independent exercise as the 
code currently assumes it is. Instead, it must be aware of its context. In this 
case, the underlying null count for the column in the predicate must be 
adjusted when used in an outer join.

The following TPC-H query gives a very clear example (see {{card-join.test}}):

{code:sql}
select c.c_custkey, o.o_orderkey
from tpch.customer c
left outer join tpch.orders o on c.c_custkey = o.o_custkey
where o.o_clerk is null
{code}

The plan, with the {{IS NULL}} filter applied twice (correct structure, wrong 
cardinality estimate):

{noformat}
PLAN-ROOT SINK
|
02:HASH JOIN [RIGHT OUTER JOIN]
|  hash predicates: o.o_custkey = c.c_custkey
|  other predicates: o.o_clerk IS NULL
|  runtime filters: RF000 <- c.c_custkey
|  row-size=51B cardinality=0
|
|--00:SCAN HDFS [tpch.customer c]
|     partitions=1/1 files=1 size=23.08MB row-size=8B cardinality=150.00K
|
01:SCAN HDFS [tpch.orders o]
   partitions=1/1 files=1 size=162.56MB row-size=43B cardinality=1.50M
   runtime filters: RF000 -> o.o_custkey
{noformat}

The math:

* The query obtains all customer rows, {{|customer| = 150K}}.
* The query obtains all order rows where the clerk is null, which is none.
* The query then left outer joins the customer table with orders. Since only 
100K customers have orders, 50K do not. The result would be a join with 50K 
null clerks.
* But, because the {{IS NULL}} calculations after the join consider only the 
{{orders}} null count, all the other rows are assumed discarded.

So, it may be that the bug earlier, which accidentally turned off {{IS NULL}} 
handling, actually makes this particular use case work better...

  was:
Suppose we have the following query:

{noformat}
select *
from tpch.customer c
where c.c_mktsegment is null
{noformat}

If we have a null count, we can estimate selectivity based on that number. In 
the case of the TPC-H test data, after a recent fix to add null count back, 
null count is zero so the cardinality of the predicate {{c.c_mktsegment is 
null}} is 0 and no rows should be returned. Yet, the query plan shows:

{noformat}
PLAN-ROOT SINK
|
00:SCAN HDFS [tpch.customer c]
   partitions=1/1 files=1 size=23.08MB row-size=218B cardinality=15.00K
   predicates: c.c_comment IS NULL
{noformat}

So, the first bug is that the existing code which is supposed to consider null 
count (found in {{IsNullPredicate.analyzeImpl()}} does not work. Reason: the 
code in {{ColumnStats}} to check if we have nulls is wrong:

{code:java}
  public boolean hasNulls() { return numNulls_ > 0; }
{code}

Zero is a perfectly valid null count: it means a NOT NULL column. The marker 
for a missing null count is -1 as shown in another method:

{code:java}
  public boolean hasStats() { return numNulls_ != -1 || numDistinctValues_ != 
-1; }
{code}

This is probably an ambiguity in the name: does "has nulls" mean:

* Do we have valid null count stats?
* Do we have null count stats and we have at least some nulls?

Fortunately, the only other use of this method is in (disabled) tests.

h4. Handle Missing Null Counts

Second, if the null count is not available (for older stats), the next-best 
approximation is 1/NDV. The code currently guesses 0.1. The 0.1 estimate is 
fine if NDV is not available either.

Note that to properly test some of these cases requires new tables in the test 
suite with no or partial stats.

h4. Special Consideration for Outer Joins

When this predicate is applied to the result of an outer join, the estimation 
methods above *will not* work. Using the table null count to estimate an outer 
join null count is clearly wrong, as is using the table NDV value. The 
fall-back of .1 will tend to under-estimate an outer join.

Instead, what is needed is a more complex estimate. Assume a left outer join 
(all rows from left, plus matching rows from right.)

{noformat}
|join| = |left 𝜎 key is not null| * |right|/|key| + |left 𝜎 key is null|
{noformat}

So, to estimate {{IS NULL}} or {{IS NOT NULL}} after an outer join must use a 
different algorithm then when estimating it in a scan.

This suggests that expression selectivity is not an independent exercise as the 
code currently assumes it is. Instead, it must be aware of its context. In this 
case, the underlying null count for the column in the predicate must be 
adjusted when used in an outer join.



> IS [NOT] NULL gives wrong selectivity when null count is missing
> ----------------------------------------------------------------
>
>                 Key: IMPALA-8050
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8050
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 3.1.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Minor
>
> Suppose we have the following query:
> {noformat}
> select *
> from tpch.customer c
> where c.c_mktsegment is null
> {noformat}
> If we have a null count, we can estimate selectivity based on that number. In 
> the case of the TPC-H test data, after a recent fix to add null count back, 
> null count is zero so the cardinality of the predicate {{c.c_mktsegment is 
> null}} is 0 and no rows should be returned. Yet, the query plan shows:
> {noformat}
> PLAN-ROOT SINK
> |
> 00:SCAN HDFS [tpch.customer c]
>    partitions=1/1 files=1 size=23.08MB row-size=218B cardinality=15.00K
>    predicates: c.c_comment IS NULL
> {noformat}
> So, the first bug is that the existing code which is supposed to consider 
> null count (found in {{IsNullPredicate.analyzeImpl()}} does not work. Reason: 
> the code in {{ColumnStats}} to check if we have nulls is wrong:
> {code:java}
>   public boolean hasNulls() { return numNulls_ > 0; }
> {code}
> Zero is a perfectly valid null count: it means a NOT NULL column. The marker 
> for a missing null count is -1 as shown in another method:
> {code:java}
>   public boolean hasStats() { return numNulls_ != -1 || numDistinctValues_ != 
> -1; }
> {code}
> This is probably an ambiguity in the name: does "has nulls" mean:
> * Do we have valid null count stats?
> * Do we have null count stats and we have at least some nulls?
> Fortunately, the only other use of this method is in (disabled) tests.
> h4. Handle Missing Null Counts
> Second, if the null count is not available (for older stats), the next-best 
> approximation is 1/NDV. The code currently guesses 0.1. The 0.1 estimate is 
> fine if NDV is not available either.
> Note that to properly test some of these cases requires new tables in the 
> test suite with no or partial stats.
> h4. Special Consideration for Outer Joins
> When this predicate is applied to the result of an outer join, the estimation 
> methods above *will not* work. Using the table null count to estimate an 
> outer join null count is clearly wrong, as is using the table NDV value. The 
> fall-back of .1 will tend to under-estimate an outer join.
> Instead, what is needed is a more complex estimate. Assume a left outer join 
> (all rows from left, plus matching rows from right.)
> {noformat}
> |join| = |left 𝜎 key is not null| * |right|/|key| + |left 𝜎 key is null|
> {noformat}
> So, to estimate {{IS NULL}} or {{IS NOT NULL}} after an outer join must use a 
> different algorithm then when estimating it in a scan.
> This suggests that expression selectivity is not an independent exercise as 
> the code currently assumes it is. Instead, it must be aware of its context. 
> In this case, the underlying null count for the column in the predicate must 
> be adjusted when used in an outer join.
> The following TPC-H query gives a very clear example (see {{card-join.test}}):
> {code:sql}
> select c.c_custkey, o.o_orderkey
> from tpch.customer c
> left outer join tpch.orders o on c.c_custkey = o.o_custkey
> where o.o_clerk is null
> {code}
> The plan, with the {{IS NULL}} filter applied twice (correct structure, wrong 
> cardinality estimate):
> {noformat}
> PLAN-ROOT SINK
> |
> 02:HASH JOIN [RIGHT OUTER JOIN]
> |  hash predicates: o.o_custkey = c.c_custkey
> |  other predicates: o.o_clerk IS NULL
> |  runtime filters: RF000 <- c.c_custkey
> |  row-size=51B cardinality=0
> |
> |--00:SCAN HDFS [tpch.customer c]
> |     partitions=1/1 files=1 size=23.08MB row-size=8B cardinality=150.00K
> |
> 01:SCAN HDFS [tpch.orders o]
>    partitions=1/1 files=1 size=162.56MB row-size=43B cardinality=1.50M
>    runtime filters: RF000 -> o.o_custkey
> {noformat}
> The math:
> * The query obtains all customer rows, {{|customer| = 150K}}.
> * The query obtains all order rows where the clerk is null, which is none.
> * The query then left outer joins the customer table with orders. Since only 
> 100K customers have orders, 50K do not. The result would be a join with 50K 
> null clerks.
> * But, because the {{IS NULL}} calculations after the join consider only the 
> {{orders}} null count, all the other rows are assumed discarded.
> So, it may be that the bug earlier, which accidentally turned off {{IS NULL}} 
> handling, actually makes this particular use case work better...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (IMPALA-8050) IS [NOT] NULL gives wrong selectivity when null count is missing

Reply via email to