[jira] [Commented] (HIVE-29524) Missing num_nulls statistic for partition columns

Stamatis Zampetakis (Jira) Wed, 25 Mar 2026 06:04:15 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-29524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18068296#comment-18068296
 ]


Stamatis Zampetakis commented on HIVE-29524:
--------------------------------------------

For more details on how the column stats are generated for partition columns 
check 
[StatsUtils.getColStatsForPartCol|https://github.com/apache/hive/blob/931d4bb62b26de699240c816df439e00644e3dcb/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L594].

Note that Iceberg partitioned tables are currently not affected since storage 
and retrieval of statistics works a bit differently.

> Missing num_nulls statistic for partition columns
> -------------------------------------------------
>
>                 Key: HIVE-29524
>                 URL: https://issues.apache.org/jira/browse/HIVE-29524
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Planning
>            Reporter: Stamatis Zampetakis
>            Priority: Major
>         Attachments: columnstats_part_num_nulls.q
>
>
> There is no information/statistics about the number of null values in 
> partition columns so {{num_nulls}} entry is always zero.
> {code:sql}
> CREATE TABLE emp (eid INT, ename STRING) partitioned by (bdate INT);
> INSERT INTO emp
> VALUES (1, 'Bob', 20200101),
>        (2, 'Alice', 20200102),
>        (3, 'Sam', 20200103),
>        (4, 'John', null),
>        (5, 'Jane', null),
>        (6, 'Tom', null),
>        (7, null, 20200103),
>        (8, null, 20200103),
>        (null, 'Tom', 20200109),
>        (null, 'Jane', 20200110);
> DESCRIBE FORMATTED emp bdate;
> {code}
> {noformat}
> col_name              bdate               
> data_type             int                 
> min                   20200101            
> max                   20200110            
> num_nulls             0                   
> distinct_count        6                   
> avg_col_len                               
> max_col_len                               
> num_trues                                 
> num_falses                                
> bit_vector                                
> comment                                   
> COLUMN_STATS_ACCURATE {\"COLUMN_STATS\":{\"bdate\":\"true\"}}
> {noformat}
> Any estimations that rely on 
> [ColStatistics.getNumNulls|https://github.com/apache/hive/blob/931d4bb62b26de699240c816df439e00644e3dcb/ql/src/java/org/apache/hadoop/hive/ql/plan/ColStatistics.java#L72]
>  are likely to be inaccurate.
> Queries/plans with IS [NOT] NULL predicates on partition columns are directly 
> affected since selectivity and row counts depend on this information. The 
> impact on CBO planning is more limited cause all predicates on partitioned 
> columns currently do not contribute to the [selectivity of the 
> predicate|https://github.com/apache/hive/blob/931d4bb62b26de699240c816df439e00644e3dcb/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L106].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-29524) Missing num_nulls statistic for partition columns

Reply via email to