Stamatis Zampetakis created HIVE-29524:
------------------------------------------
Summary: Missing num_nulls statistic for partition columns
Key: HIVE-29524
URL: https://issues.apache.org/jira/browse/HIVE-29524
Project: Hive
Issue Type: Bug
Components: Query Planning
Reporter: Stamatis Zampetakis
There is no information/statistics about the number of null values in partition
columns so {{num_nulls}} entry is always zero.
{code:sql}
CREATE TABLE emp (eid INT, ename STRING) partitioned by (bdate INT);
INSERT INTO emp
VALUES (1, 'Bob', 20200101),
(2, 'Alice', 20200102),
(3, 'Sam', 20200103),
(4, 'John', null),
(5, 'Jane', null),
(6, 'Tom', null),
(7, null, 20200103),
(8, null, 20200103),
(null, 'Tom', 20200109),
(null, 'Jane', 20200110);
DESCRIBE FORMATTED emp bdate;
{code}
{noformat}
col_name bdate
data_type int
min 20200101
max 20200110
num_nulls 0
distinct_count 6
avg_col_len
max_col_len
num_trues
num_falses
bit_vector
comment
COLUMN_STATS_ACCURATE {\"COLUMN_STATS\":{\"bdate\":\"true\"}}
{noformat}
Any estimations that rely on
[ColStatistics.getNumNulls|https://github.com/apache/hive/blob/931d4bb62b26de699240c816df439e00644e3dcb/ql/src/java/org/apache/hadoop/hive/ql/plan/ColStatistics.java#L72]
are likely to be inaccurate.
Queries/plans with IS [NOT] NULL predicates on partition columns are directly
affected since selectivity and row counts depend on this information. The
impact on CBO planning is more limited cause all predicates on partitioned
columns currently do not contribute to the [selectivity of the
predicate|https://github.com/apache/hive/blob/931d4bb62b26de699240c816df439e00644e3dcb/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L106].
--
This message was sent by Atlassian Jira
(v8.20.10#820010)