Sungwoo Park created HIVE-29371:
-----------------------------------
Summary: distinct_count estimates for partition columns of Iceberg
tables are much less accurate than for Hive tables.
Key: HIVE-29371
URL: https://issues.apache.org/jira/browse/HIVE-29371
Project: Hive
Issue Type: Bug
Affects Versions: 4.2.0
Reporter: Sungwoo Park
Executing 'analyze table compute statistics for columns' estimates
distinct_count values as column-level statistics of tables.
On a partitioned Iceberg table, distinct_count values for partition columns
are very different from their values in its equivalent Hive-native table.
For complex queries with many joins, the high inaccuracy can lead to very
inefficent query plans. For example, on the 10TB TPC-DS benchmark, the total
running time increases a lot. Example: for query 1 to query 22, 900 seconds -->
1400 seconds.
Here is an example of creating a catalog_returns table in the 10TB TPC-DS
benchmark.
1. Create catalog_returns table from 10TB TPC-DS data.
set hive.iceberg.stats.source=iceberg;
drop table if exists catalog_returns purge;
create external table catalog_returns
( cr_returned_time_sk bigint, cr_item_sk bigint, cr_refunded_customer_sk
bigint, cr_refunded_cdemo_sk bigint, cr_refunded_hdemo_sk bigint,
cr_refunded_addr_sk bigint, cr_returning_customer_sk bigint,
cr_returning_cdemo_sk bigint, cr_returning_hdemo_sk bigint,
cr_returning_addr_sk bigint, cr_call_center_sk bigint, cr_catalog_page_sk
bigint, cr_ship_mode_sk bigint, cr_warehouse_sk bigint, cr_reason_sk bigint,
cr_order_number bigint, cr_return_quantity int, cr_return_amount double,
cr_return_tax double, cr_return_amt_inc_tax double, cr_fee double,
cr_return_ship_cost double, cr_refunded_cash double, cr_reversed_charge double,
cr_store_credit double, cr_net_loss double)
partitioned by (cr_returned_date_sk bigint)
STORED BY ICEBERG
stored as orc
tblproperties ("orc.compress"="SNAPPY");
insert overwrite table catalog_returns select * from
tpcds_bin_partitioned_orc_10000.catalog_returns;
2. The correct distinct_count values are computed as follows:
--- partition column cr_returned_date_sk
select count(distinct cr_returned_date_sk) from catalog_returns;
| 2104 |
--- a sample non-partition column cr_returned_time_sk
select count(distinct cr_returned_time_sk) from catalog_returns;
| 86400 |
--- before executing 'analyze table'
describe formatted catalog_returns cr_returned_date_sk;
| distinct_count | 2
3. Now, execute 'analyze table'.
analyze table catalog_returns compute statistics for columns;
--- On a Hive table (not an Iceberg table) with the same data, it reports 2104,
so distinct_count=248 is highly inaccurate.
describe formatted catalog_returns cr_returned_date_sk;
| distinct_count | 248 |
--- For a sample non-partition column cr_returned_time_sk, distinct_count is
exactly the same as obtained from a Hive-native table with the same data.
describe formatted catalog_returns cr_returned_time_sk;
| distinct_count | 85503 |
--
This message was sent by Atlassian Jira
(v8.20.10#820010)