[jira] [Commented] (HIVE-29371) distinct_count estimates for partition columns of Iceberg tables are much less accurate than for Hive tables.

Sungwoo Park (Jira) Mon, 15 Dec 2025 08:27:27 -0800


    [ 
https://issues.apache.org/jira/browse/HIVE-29371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18045211#comment-18045211
 ]


Sungwoo Park commented on HIVE-29371:
-------------------------------------

Here are findings from a few more experiments.
 # If an Iceberg table is created with 'PARTITIONED BY', column levels 
statistics such as min, max, and distinct_count are highly inaccurate. For 
example, the true value of distinct_count is 2104, Hive may report 2.
 # The statistics also vary randomly. The same experiment (create an Iceberg 
table, load it with data, and run 'ANALYZE') may produce different results.

If 'PARTITIONED BY' is not used when creating Iceberg tables, column level 
statistics are highly accurate (similarly to Hive-native tables).

Probably because of bad column statistics, running 10TB TPC-DS queries takes 
much longer on Iceberg tables than on Hive-native tables.

> distinct_count estimates for partition columns of Iceberg tables are much 
> less accurate than for Hive tables.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-29371
>                 URL: https://issues.apache.org/jira/browse/HIVE-29371
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 4.2.0
>            Reporter: Sungwoo Park
>            Priority: Major
>
> Executing 'analyze table compute statistics for columns' estimates 
> distinct_count values as column-level statistics of tables.
> On a partitioned Iceberg table, distinct_count values for partition columns  
> are very different from their values in its equivalent Hive-native table.
> For complex queries with many joins, the high inaccuracy can lead to very 
> inefficent query plans. For example, on the 10TB TPC-DS benchmark, the total 
> running time increases a lot. Example: for query 1 to query 22, 900 seconds 
> --> 1400 seconds. 
> Here is an example of creating a catalog_returns table in the 10TB TPC-DS 
> benchmark.
> 1. Create catalog_returns table from 10TB TPC-DS data.
> set hive.iceberg.stats.source=iceberg;
> drop table if exists catalog_returns purge;
> create external table catalog_returns
> ( cr_returned_time_sk bigint, cr_item_sk bigint, cr_refunded_customer_sk 
> bigint, cr_refunded_cdemo_sk bigint, cr_refunded_hdemo_sk bigint, 
> cr_refunded_addr_sk bigint, cr_returning_customer_sk bigint, 
> cr_returning_cdemo_sk bigint, cr_returning_hdemo_sk bigint, 
> cr_returning_addr_sk bigint, cr_call_center_sk bigint, cr_catalog_page_sk 
> bigint, cr_ship_mode_sk bigint, cr_warehouse_sk bigint, cr_reason_sk bigint, 
> cr_order_number bigint, cr_return_quantity int, cr_return_amount double, 
> cr_return_tax double, cr_return_amt_inc_tax double, cr_fee double, 
> cr_return_ship_cost double, cr_refunded_cash double, cr_reversed_charge 
> double, cr_store_credit double, cr_net_loss double)
> partitioned by (cr_returned_date_sk bigint)
> STORED BY ICEBERG
> stored as orc
> tblproperties ("orc.compress"="SNAPPY");
> insert overwrite table catalog_returns select * from 
> tpcds_bin_partitioned_orc_10000.catalog_returns;
> 2. The correct distinct_count values are computed as follows:
> --- partition column cr_returned_date_sk
> select count(distinct cr_returned_date_sk) from catalog_returns;
> | 2104  |
> --- a sample non-partition column cr_returned_time_sk
> select count(distinct cr_returned_time_sk) from catalog_returns;
> | 86400  |
> --- before executing 'analyze table'
> describe formatted catalog_returns cr_returned_date_sk;
> | distinct_count         | 2
> 3. Now, execute 'analyze table'.
> analyze table catalog_returns compute statistics for columns;
> --- On a Hive table (not an Iceberg table) with the same data, it reports 
> 2104, so distinct_count=248 is highly inaccurate.
> describe formatted catalog_returns cr_returned_date_sk;
> | distinct_count         | 248                   |
> --- For a sample non-partition column cr_returned_time_sk, distinct_count is 
> exactly the same as obtained from a Hive-native table with the same data.
> describe formatted catalog_returns cr_returned_time_sk;
> | distinct_count         | 85503                 |



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-29371) distinct_count estimates for partition columns of Iceberg tables are much less accurate than for Hive tables.

Reply via email to