[
https://issues.apache.org/jira/browse/HIVE-29371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18046814#comment-18046814
]
Sungwoo Park commented on HIVE-29371:
-------------------------------------
When ColStatsProcessor.persistColumnStats() is called,
constructColumnStatsFromPackedRows() procesess column stat objects in a batch
of HIVE_STATS_MAX_NUM_STATS objects.
For Iceberg tables, only the last becomes effective, while all the previous
batches are ignored.
A workaround is to override the default value of hive.stats.max.num.stats
(10000) by setting it to a value larger than the number of partitions.
> distinct_count estimates for partition columns of Iceberg tables are much
> less accurate than for Hive tables.
> -------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-29371
> URL: https://issues.apache.org/jira/browse/HIVE-29371
> Project: Hive
> Issue Type: Bug
> Affects Versions: 4.2.0
> Reporter: Sungwoo Park
> Priority: Major
>
> Executing 'analyze table compute statistics for columns' estimates
> distinct_count values as column-level statistics of tables.
> On a partitioned Iceberg table, distinct_count values for partition columns
> are very different from their values in its equivalent Hive-native table.
> For complex queries with many joins, the high inaccuracy can lead to very
> inefficent query plans. For example, on the 10TB TPC-DS benchmark, the total
> running time increases a lot. Example: for query 1 to query 22, 900 seconds
> --> 1400 seconds.
> Here is an example of creating a catalog_returns table in the 10TB TPC-DS
> benchmark.
> 1. Create catalog_returns table from 10TB TPC-DS data.
> set hive.iceberg.stats.source=iceberg;
> drop table if exists catalog_returns purge;
> create external table catalog_returns
> ( cr_returned_time_sk bigint, cr_item_sk bigint, cr_refunded_customer_sk
> bigint, cr_refunded_cdemo_sk bigint, cr_refunded_hdemo_sk bigint,
> cr_refunded_addr_sk bigint, cr_returning_customer_sk bigint,
> cr_returning_cdemo_sk bigint, cr_returning_hdemo_sk bigint,
> cr_returning_addr_sk bigint, cr_call_center_sk bigint, cr_catalog_page_sk
> bigint, cr_ship_mode_sk bigint, cr_warehouse_sk bigint, cr_reason_sk bigint,
> cr_order_number bigint, cr_return_quantity int, cr_return_amount double,
> cr_return_tax double, cr_return_amt_inc_tax double, cr_fee double,
> cr_return_ship_cost double, cr_refunded_cash double, cr_reversed_charge
> double, cr_store_credit double, cr_net_loss double)
> partitioned by (cr_returned_date_sk bigint)
> STORED BY ICEBERG
> stored as orc
> tblproperties ("orc.compress"="SNAPPY");
> insert overwrite table catalog_returns select * from
> tpcds_bin_partitioned_orc_10000.catalog_returns;
> 2. The correct distinct_count values are computed as follows:
> --- partition column cr_returned_date_sk
> select count(distinct cr_returned_date_sk) from catalog_returns;
> | 2104 |
> --- a sample non-partition column cr_returned_time_sk
> select count(distinct cr_returned_time_sk) from catalog_returns;
> | 86400 |
> --- before executing 'analyze table'
> describe formatted catalog_returns cr_returned_date_sk;
> | distinct_count | 2
> 3. Now, execute 'analyze table'.
> analyze table catalog_returns compute statistics for columns;
> --- On a Hive table (not an Iceberg table) with the same data, it reports
> 2104, so distinct_count=248 is highly inaccurate.
> describe formatted catalog_returns cr_returned_date_sk;
> | distinct_count | 248 |
> --- For a sample non-partition column cr_returned_time_sk, distinct_count is
> exactly the same as obtained from a Hive-native table with the same data.
> describe formatted catalog_returns cr_returned_time_sk;
> | distinct_count | 85503 |
--
This message was sent by Atlassian Jira
(v8.20.10#820010)