[ 
https://issues.apache.org/jira/browse/HIVE-29371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18046814#comment-18046814
 ] 

Sungwoo Park commented on HIVE-29371:
-------------------------------------

When ColStatsProcessor.persistColumnStats() is called, 
constructColumnStatsFromPackedRows() procesess column stat objects in a batch 
of HIVE_STATS_MAX_NUM_STATS objects.
For Iceberg tables, only the last becomes effective, while all the previous 
batches are ignored.

A workaround is to override the default value of hive.stats.max.num.stats 
(10000) by setting it to a value larger than the number of partitions.

 

> distinct_count estimates for partition columns of Iceberg tables are much 
> less accurate than for Hive tables.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-29371
>                 URL: https://issues.apache.org/jira/browse/HIVE-29371
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 4.2.0
>            Reporter: Sungwoo Park
>            Priority: Major
>
> Executing 'analyze table compute statistics for columns' estimates 
> distinct_count values as column-level statistics of tables.
> On a partitioned Iceberg table, distinct_count values for partition columns  
> are very different from their values in its equivalent Hive-native table.
> For complex queries with many joins, the high inaccuracy can lead to very 
> inefficent query plans. For example, on the 10TB TPC-DS benchmark, the total 
> running time increases a lot. Example: for query 1 to query 22, 900 seconds 
> --> 1400 seconds. 
> Here is an example of creating a catalog_returns table in the 10TB TPC-DS 
> benchmark.
> 1. Create catalog_returns table from 10TB TPC-DS data.
> set hive.iceberg.stats.source=iceberg;
> drop table if exists catalog_returns purge;
> create external table catalog_returns
> ( cr_returned_time_sk bigint, cr_item_sk bigint, cr_refunded_customer_sk 
> bigint, cr_refunded_cdemo_sk bigint, cr_refunded_hdemo_sk bigint, 
> cr_refunded_addr_sk bigint, cr_returning_customer_sk bigint, 
> cr_returning_cdemo_sk bigint, cr_returning_hdemo_sk bigint, 
> cr_returning_addr_sk bigint, cr_call_center_sk bigint, cr_catalog_page_sk 
> bigint, cr_ship_mode_sk bigint, cr_warehouse_sk bigint, cr_reason_sk bigint, 
> cr_order_number bigint, cr_return_quantity int, cr_return_amount double, 
> cr_return_tax double, cr_return_amt_inc_tax double, cr_fee double, 
> cr_return_ship_cost double, cr_refunded_cash double, cr_reversed_charge 
> double, cr_store_credit double, cr_net_loss double)
> partitioned by (cr_returned_date_sk bigint)
> STORED BY ICEBERG
> stored as orc
> tblproperties ("orc.compress"="SNAPPY");
> insert overwrite table catalog_returns select * from 
> tpcds_bin_partitioned_orc_10000.catalog_returns;
> 2. The correct distinct_count values are computed as follows:
> --- partition column cr_returned_date_sk
> select count(distinct cr_returned_date_sk) from catalog_returns;
> | 2104  |
> --- a sample non-partition column cr_returned_time_sk
> select count(distinct cr_returned_time_sk) from catalog_returns;
> | 86400  |
> --- before executing 'analyze table'
> describe formatted catalog_returns cr_returned_date_sk;
> | distinct_count         | 2
> 3. Now, execute 'analyze table'.
> analyze table catalog_returns compute statistics for columns;
> --- On a Hive table (not an Iceberg table) with the same data, it reports 
> 2104, so distinct_count=248 is highly inaccurate.
> describe formatted catalog_returns cr_returned_date_sk;
> | distinct_count         | 248                   |
> --- For a sample non-partition column cr_returned_time_sk, distinct_count is 
> exactly the same as obtained from a Hive-native table with the same data.
> describe formatted catalog_returns cr_returned_time_sk;
> | distinct_count         | 85503                 |



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to