Stamatis Zampetakis created HIVE-29524:
------------------------------------------

             Summary: Missing num_nulls statistic for partition columns
                 Key: HIVE-29524
                 URL: https://issues.apache.org/jira/browse/HIVE-29524
             Project: Hive
          Issue Type: Bug
          Components: Query Planning
            Reporter: Stamatis Zampetakis


There is no information/statistics about the number of null values in partition 
columns so {{num_nulls}} entry is always zero.

{code:sql}
CREATE TABLE emp (eid INT, ename STRING) partitioned by (bdate INT);

INSERT INTO emp
VALUES (1, 'Bob', 20200101),
       (2, 'Alice', 20200102),
       (3, 'Sam', 20200103),
       (4, 'John', null),
       (5, 'Jane', null),
       (6, 'Tom', null),
       (7, null, 20200103),
       (8, null, 20200103),
       (null, 'Tom', 20200109),
       (null, 'Jane', 20200110);

DESCRIBE FORMATTED emp bdate;
{code}

{noformat}
col_name                bdate               
data_type               int                 
min                     20200101            
max                     20200110            
num_nulls               0                   
distinct_count          6                   
avg_col_len                                 
max_col_len                                 
num_trues                                   
num_falses                                  
bit_vector                                  
comment                                     
COLUMN_STATS_ACCURATE   {\"COLUMN_STATS\":{\"bdate\":\"true\"}}
{noformat}

Any estimations that rely on 
[ColStatistics.getNumNulls|https://github.com/apache/hive/blob/931d4bb62b26de699240c816df439e00644e3dcb/ql/src/java/org/apache/hadoop/hive/ql/plan/ColStatistics.java#L72]
 are likely to be inaccurate.

Queries/plans with IS [NOT] NULL predicates on partition columns are directly 
affected since selectivity and row counts depend on this information. The 
impact on CBO planning is more limited cause all predicates on partitioned 
columns currently do not contribute to the [selectivity of the 
predicate|https://github.com/apache/hive/blob/931d4bb62b26de699240c816df439e00644e3dcb/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L106].






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to