Sungwoo Park created HIVE-29371:
-----------------------------------

             Summary: distinct_count estimates for partition columns of Iceberg 
tables are much less accurate than for Hive tables.
                 Key: HIVE-29371
                 URL: https://issues.apache.org/jira/browse/HIVE-29371
             Project: Hive
          Issue Type: Bug
    Affects Versions: 4.2.0
            Reporter: Sungwoo Park


Executing 'analyze table compute statistics for columns' estimates 
distinct_count values as column-level statistics of tables.

On a partitioned Iceberg table, distinct_count values for partition columns  
are very different from their values in its equivalent Hive-native table.

For complex queries with many joins, the high inaccuracy can lead to very 
inefficent query plans. For example, on the 10TB TPC-DS benchmark, the total 
running time increases a lot. Example: for query 1 to query 22, 900 seconds --> 
1400 seconds. 

Here is an example of creating a catalog_returns table in the 10TB TPC-DS 
benchmark.

1. Create catalog_returns table from 10TB TPC-DS data.

set hive.iceberg.stats.source=iceberg;

drop table if exists catalog_returns purge;
create external table catalog_returns
( cr_returned_time_sk bigint, cr_item_sk bigint, cr_refunded_customer_sk 
bigint, cr_refunded_cdemo_sk bigint, cr_refunded_hdemo_sk bigint, 
cr_refunded_addr_sk bigint, cr_returning_customer_sk bigint, 
cr_returning_cdemo_sk bigint, cr_returning_hdemo_sk bigint, 
cr_returning_addr_sk bigint, cr_call_center_sk bigint, cr_catalog_page_sk 
bigint, cr_ship_mode_sk bigint, cr_warehouse_sk bigint, cr_reason_sk bigint, 
cr_order_number bigint, cr_return_quantity int, cr_return_amount double, 
cr_return_tax double, cr_return_amt_inc_tax double, cr_fee double, 
cr_return_ship_cost double, cr_refunded_cash double, cr_reversed_charge double, 
cr_store_credit double, cr_net_loss double)
partitioned by (cr_returned_date_sk bigint)
STORED BY ICEBERG
stored as orc
tblproperties ("orc.compress"="SNAPPY");

insert overwrite table catalog_returns select * from 
tpcds_bin_partitioned_orc_10000.catalog_returns;

2. The correct distinct_count values are computed as follows:

--- partition column cr_returned_date_sk
select count(distinct cr_returned_date_sk) from catalog_returns;
| 2104  |

--- a sample non-partition column cr_returned_time_sk
select count(distinct cr_returned_time_sk) from catalog_returns;
| 86400  |

--- before executing 'analyze table'
describe formatted catalog_returns cr_returned_date_sk;
| distinct_count         | 2

3. Now, execute 'analyze table'.

analyze table catalog_returns compute statistics for columns;

--- On a Hive table (not an Iceberg table) with the same data, it reports 2104, 
so distinct_count=248 is highly inaccurate.
describe formatted catalog_returns cr_returned_date_sk;
| distinct_count         | 248                   |

--- For a sample non-partition column cr_returned_time_sk, distinct_count is 
exactly the same as obtained from a Hive-native table with the same data.
describe formatted catalog_returns cr_returned_time_sk;
| distinct_count         | 85503                 |



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to