Stamatis Zampetakis created HIVE-29541:
------------------------------------------
Summary: Imprecise NDV stats on Iceberg partition columns
Key: HIVE-29541
URL: https://issues.apache.org/jira/browse/HIVE-29541
Project: Hive
Issue Type: Bug
Components: Statistics
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis
The number of distinct values (NDV/countDistinct) statistic is slightly off for
Iceberg partition columns. Currently, the NDV stats for Iceberg (partition and
regular) columns is computed by aggregating the individual stats from each
partition. The aggregation logic is subject to a small margin of error since
there is no way to have a fully accurate result from the moment that we rely on
probabilistic data structures (i.e., HyperLogLog).
However, for partition columns we know exactly how many partitions are present
in the table so we don't need to rely on probabilistic data structures since
the NDV is equal to the number of partitions (no complex aggregation needed).
The StatsUtils class already contains some logic (getColStatsForPartCol) to
compute the NDV along with some other stats directly from partitions but this
does not kick in for Iceberg tables.
The problem can be seen also in qtests after loading the LINEITEM table from
TPC_0_001 database in an Iceberg table using the L_ORDERKEY as a partition key
and running DESCRIBE FORMATTED on the partitioning column.
{code:sql}
DESC FORMATTED ice.lineitem l_orderkey
{code}
{noformat}
col_name L_ORDERKEY
data_type int
min 1
max 5988
num_nulls 0
distinct_count 1523
avg_col_len
max_col_len
num_trues
num_falses
bit_vector HL
comment
COLUMN_STATS_ACCURATE
{\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"l_comment\":\"true\",\"l_commitdate\":\"true\",\"l_discount\":\"true\",\"l_extendedprice\":\"true\",\"l_linenumber\":\"true\",\"l_linestatus\":\"true\",\"l_orderkey\":\"true\",\"l_partkey\":\"true\",\"l_quantity\":\"true\",\"l_receiptdate\":\"true\",\"l_returnflag\":\"true\",\"l_shipdate\":\"true\",\"l_shipinstruct\":\"true\",\"l_shipmode\":\"true\",\"l_suppkey\":\"true\",\"l_tax\":\"true\"}}
{noformat}
Observe that distinct_count (NDV) is 1523 while the real number is 1500. In
non-Iceberg tables the NDV is accurate.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)