[
https://issues.apache.org/jira/browse/HIVE-29524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18068296#comment-18068296
]
Stamatis Zampetakis commented on HIVE-29524:
--------------------------------------------
For more details on how the column stats are generated for partition columns
check
[StatsUtils.getColStatsForPartCol|https://github.com/apache/hive/blob/931d4bb62b26de699240c816df439e00644e3dcb/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L594].
Note that Iceberg partitioned tables are currently not affected since storage
and retrieval of statistics works a bit differently.
> Missing num_nulls statistic for partition columns
> -------------------------------------------------
>
> Key: HIVE-29524
> URL: https://issues.apache.org/jira/browse/HIVE-29524
> Project: Hive
> Issue Type: Bug
> Components: Query Planning
> Reporter: Stamatis Zampetakis
> Priority: Major
> Attachments: columnstats_part_num_nulls.q
>
>
> There is no information/statistics about the number of null values in
> partition columns so {{num_nulls}} entry is always zero.
> {code:sql}
> CREATE TABLE emp (eid INT, ename STRING) partitioned by (bdate INT);
> INSERT INTO emp
> VALUES (1, 'Bob', 20200101),
> (2, 'Alice', 20200102),
> (3, 'Sam', 20200103),
> (4, 'John', null),
> (5, 'Jane', null),
> (6, 'Tom', null),
> (7, null, 20200103),
> (8, null, 20200103),
> (null, 'Tom', 20200109),
> (null, 'Jane', 20200110);
> DESCRIBE FORMATTED emp bdate;
> {code}
> {noformat}
> col_name bdate
> data_type int
> min 20200101
> max 20200110
> num_nulls 0
> distinct_count 6
> avg_col_len
> max_col_len
> num_trues
> num_falses
> bit_vector
> comment
> COLUMN_STATS_ACCURATE {\"COLUMN_STATS\":{\"bdate\":\"true\"}}
> {noformat}
> Any estimations that rely on
> [ColStatistics.getNumNulls|https://github.com/apache/hive/blob/931d4bb62b26de699240c816df439e00644e3dcb/ql/src/java/org/apache/hadoop/hive/ql/plan/ColStatistics.java#L72]
> are likely to be inaccurate.
> Queries/plans with IS [NOT] NULL predicates on partition columns are directly
> affected since selectivity and row counts depend on this information. The
> impact on CBO planning is more limited cause all predicates on partitioned
> columns currently do not contribute to the [selectivity of the
> predicate|https://github.com/apache/hive/blob/931d4bb62b26de699240c816df439e00644e3dcb/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L106].
--
This message was sent by Atlassian Jira
(v8.20.10#820010)