[ 
https://issues.apache.org/jira/browse/HIVE-29371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18045400#comment-18045400
 ] 

Sungwoo Park commented on HIVE-29371:
-------------------------------------

For Iceberg tables,
the partition column cr_returned_date_sk is treated like an ordinary column 
that happens to be in an Iceberg partition spec.
So, ANALYZE runs a full HLL-based NDV computation for it.

(In contrast, a Hive-native partitioned table has partition keys in the table 
metadata, so ANALYZE skips partition columns when building the column-stats 
work.)

In our case, grouping by cr_returned_date_sk means each partition sees only a 
single value for that column, so the per-partition NDV is 1.

Because each partition’s NDV for cr_returned_date_sk is 1, Hive has to 
aggregate those per-partition NDVs to a table-level NDV.

On the Iceberg path, this aggregation seems wrong for some reason, because the 
finall distinct_count reported is extremely small (2 or 4) instead of the 
correct value of 2104. For example, it may be that HLLs are not being used 
correctly on the Iceberg side.

In addition, I still don't understand why min/max values are wrong. They are 
supposed to be correctly computed. For example, we see (correct value of max = 
2452924).

| min                    | 2450821               |
| max                    | 2452925               |
| num_nulls              | 0                     |
| distinct_count         | 4                     |

So, omitting grouping by the partition column can be only a quick fix, not a 
complete principled solution.

If you are familar with the code responsible for aggregating column statistics, 
please let me know.

> distinct_count estimates for partition columns of Iceberg tables are much 
> less accurate than for Hive tables.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-29371
>                 URL: https://issues.apache.org/jira/browse/HIVE-29371
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 4.2.0
>            Reporter: Sungwoo Park
>            Priority: Major
>
> Executing 'analyze table compute statistics for columns' estimates 
> distinct_count values as column-level statistics of tables.
> On a partitioned Iceberg table, distinct_count values for partition columns  
> are very different from their values in its equivalent Hive-native table.
> For complex queries with many joins, the high inaccuracy can lead to very 
> inefficent query plans. For example, on the 10TB TPC-DS benchmark, the total 
> running time increases a lot. Example: for query 1 to query 22, 900 seconds 
> --> 1400 seconds. 
> Here is an example of creating a catalog_returns table in the 10TB TPC-DS 
> benchmark.
> 1. Create catalog_returns table from 10TB TPC-DS data.
> set hive.iceberg.stats.source=iceberg;
> drop table if exists catalog_returns purge;
> create external table catalog_returns
> ( cr_returned_time_sk bigint, cr_item_sk bigint, cr_refunded_customer_sk 
> bigint, cr_refunded_cdemo_sk bigint, cr_refunded_hdemo_sk bigint, 
> cr_refunded_addr_sk bigint, cr_returning_customer_sk bigint, 
> cr_returning_cdemo_sk bigint, cr_returning_hdemo_sk bigint, 
> cr_returning_addr_sk bigint, cr_call_center_sk bigint, cr_catalog_page_sk 
> bigint, cr_ship_mode_sk bigint, cr_warehouse_sk bigint, cr_reason_sk bigint, 
> cr_order_number bigint, cr_return_quantity int, cr_return_amount double, 
> cr_return_tax double, cr_return_amt_inc_tax double, cr_fee double, 
> cr_return_ship_cost double, cr_refunded_cash double, cr_reversed_charge 
> double, cr_store_credit double, cr_net_loss double)
> partitioned by (cr_returned_date_sk bigint)
> STORED BY ICEBERG
> stored as orc
> tblproperties ("orc.compress"="SNAPPY");
> insert overwrite table catalog_returns select * from 
> tpcds_bin_partitioned_orc_10000.catalog_returns;
> 2. The correct distinct_count values are computed as follows:
> --- partition column cr_returned_date_sk
> select count(distinct cr_returned_date_sk) from catalog_returns;
> | 2104  |
> --- a sample non-partition column cr_returned_time_sk
> select count(distinct cr_returned_time_sk) from catalog_returns;
> | 86400  |
> --- before executing 'analyze table'
> describe formatted catalog_returns cr_returned_date_sk;
> | distinct_count         | 2
> 3. Now, execute 'analyze table'.
> analyze table catalog_returns compute statistics for columns;
> --- On a Hive table (not an Iceberg table) with the same data, it reports 
> 2104, so distinct_count=248 is highly inaccurate.
> describe formatted catalog_returns cr_returned_date_sk;
> | distinct_count         | 248                   |
> --- For a sample non-partition column cr_returned_time_sk, distinct_count is 
> exactly the same as obtained from a Hive-native table with the same data.
> describe formatted catalog_returns cr_returned_time_sk;
> | distinct_count         | 85503                 |



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to