[ 
https://issues.apache.org/jira/browse/HIVE-29646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18089093#comment-18089093
 ] 

Denys Kuzmenko commented on HIVE-29646:
---------------------------------------

Ideally, statistics should be kept up to date, either through incremental 
maintenance or via scheduled recomputation. If we start from the assumption 
that statistics on external tables are inherently untrustworthy, then I think a 
large class of cost-based optimizations becomes questionable, not just runtime 
filtering.

In Iceberg table-level basic statistics are derived directly from the current 
snapshot metadata and are therefore accurate by construction. The concern is 
more around partition-level/column statistics, which can become stale when data 
is modified. 

I also think the configuration remains useful. Even if we decide to allow 
statistics-based optimizations such as runtime filtering on external tables, 
there is still a distinction between using statistics to improve a query plan 
versus using statistics to answer a query. For example, optimizations like 
DPP/runtime filtering can at worst lead to a suboptimal plan when statistics 
are stale, whereas using statistics to answer {{COUNT(*)}} directly can return 
incorrect results. The latter is a much stronger correctness concern.

The property can still be used to guard operations where stale metadata may 
affect query correctness, while allowing statistics-based planning 
optimizations.

WDYT?

> Enable semijoin reduction and map-join conversion on external tables with 
> accurate statistics
> ---------------------------------------------------------------------------------------------
>
>                 Key: HIVE-29646
>                 URL: https://issues.apache.org/jira/browse/HIVE-29646
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Denys Kuzmenko
>            Priority: Major
>              Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to