Aman Sinha created IMPALA-9911:
----------------------------------
Summary: IS [NOT] NULL predicate selectivity estimate is wrong if
#nulls is 0
Key: IMPALA-9911
URL: https://issues.apache.org/jira/browse/IMPALA-9911
Project: IMPALA
Issue Type: Bug
Components: Frontend
Affects Versions: Impala 3.4.0
Reporter: Aman Sinha
Assignee: Aman Sinha
Consider the tpcds customer table .. its c_current_addr_sk column has #Nulls =
0 as shown below.
{noformat}
tpcds> show column stats customer;
+------------------------+--------+------------------+--------+----------+-------------------+
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg
Size |
+------------------------+--------+------------------+--------+----------+-------------------+
....
| c_current_cdemo_sk | INT | 91558 | 3438 | 4 | 4
|
| c_current_hdemo_sk | INT | 7376 | 3431 | 4 | 4
|
| c_current_addr_sk | INT | 42003 | 0 | 4 | 4
|
....
{noformat}
The cardinality estimate for the following predicates shows a default
selectivity of 10% being applied which is not correct:
{noformat}
explain select c_current_addr_sk from customer where c_current_addr_sk is not
null;
| 00:SCAN HDFS [tpcds.customer] |
| HDFS partitions=1/1 files=1 size=12.60MB |
| predicates: c_current_addr_sk IS NOT NULL |
| row-size=4B cardinality=10.00K |
+------------------------------------------------------------+
explain select c_current_addr_sk from customer where c_current_addr_sk is null;
| 00:SCAN HDFS [tpcds.customer] |
| HDFS partitions=1/1 files=1 size=12.60MB |
| predicates: c_current_addr_sk IS NULL |
| row-size=4B cardinality=10.00K |
{noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)