Stamatis Zampetakis created HIVE-29541:
------------------------------------------

             Summary: Imprecise NDV stats on Iceberg partition columns
                 Key: HIVE-29541
                 URL: https://issues.apache.org/jira/browse/HIVE-29541
             Project: Hive
          Issue Type: Bug
          Components: Statistics
            Reporter: Stamatis Zampetakis
            Assignee: Stamatis Zampetakis


The number of distinct values (NDV/countDistinct) statistic is slightly off for 
Iceberg partition columns.  Currently, the NDV stats for Iceberg (partition and 
regular) columns is computed by aggregating the individual stats from each 
partition. The aggregation logic is subject to a small margin of error since 
there is no way to have a fully accurate result from the moment that we rely on 
probabilistic data structures (i.e., HyperLogLog).

However, for partition columns we know exactly how many partitions are present 
in the table so we don't need to rely on probabilistic data structures since 
the NDV is equal to the number of partitions (no complex aggregation needed). 
The StatsUtils class already contains some logic (getColStatsForPartCol) to 
compute the NDV along with some other stats directly from partitions but this 
does not kick in for Iceberg tables.

The problem can be seen also in qtests after loading the LINEITEM table from 
TPC_0_001 database in an Iceberg table using the L_ORDERKEY as a partition key 
and running DESCRIBE FORMATTED on the partitioning column.

{code:sql}
DESC FORMATTED ice.lineitem l_orderkey
{code}

{noformat}
col_name                L_ORDERKEY          
data_type               int                 
min                     1                   
max                     5988                
num_nulls               0                   
distinct_count          1523                
avg_col_len                                 
max_col_len                                 
num_trues                                   
num_falses                                  
bit_vector              HL                  
comment                                     
COLUMN_STATS_ACCURATE   
{\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"l_comment\":\"true\",\"l_commitdate\":\"true\",\"l_discount\":\"true\",\"l_extendedprice\":\"true\",\"l_linenumber\":\"true\",\"l_linestatus\":\"true\",\"l_orderkey\":\"true\",\"l_partkey\":\"true\",\"l_quantity\":\"true\",\"l_receiptdate\":\"true\",\"l_returnflag\":\"true\",\"l_shipdate\":\"true\",\"l_shipinstruct\":\"true\",\"l_shipmode\":\"true\",\"l_suppkey\":\"true\",\"l_tax\":\"true\"}}
{noformat}
Observe that distinct_count (NDV) is 1523 while the real number is 1500. In 
non-Iceberg tables the NDV is accurate.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to