[
https://issues.apache.org/jira/browse/HIVE-29646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18089093#comment-18089093
]
Denys Kuzmenko commented on HIVE-29646:
---------------------------------------
Ideally, statistics should be kept up to date, either through incremental
maintenance or via scheduled recomputation. If we start from the assumption
that statistics on external tables are inherently untrustworthy, then I think a
large class of cost-based optimizations becomes questionable, not just runtime
filtering.
In Iceberg table-level basic statistics are derived directly from the current
snapshot metadata and are therefore accurate by construction. The concern is
more around partition-level/column statistics, which can become stale when data
is modified.
I also think the configuration remains useful. Even if we decide to allow
statistics-based optimizations such as runtime filtering on external tables,
there is still a distinction between using statistics to improve a query plan
versus using statistics to answer a query. For example, optimizations like
DPP/runtime filtering can at worst lead to a suboptimal plan when statistics
are stale, whereas using statistics to answer {{COUNT(*)}} directly can return
incorrect results. The latter is a much stronger correctness concern.
The property can still be used to guard operations where stale metadata may
affect query correctness, while allowing statistics-based planning
optimizations.
WDYT?
> Enable semijoin reduction and map-join conversion on external tables with
> accurate statistics
> ---------------------------------------------------------------------------------------------
>
> Key: HIVE-29646
> URL: https://issues.apache.org/jira/browse/HIVE-29646
> Project: Hive
> Issue Type: Improvement
> Reporter: Denys Kuzmenko
> Priority: Major
> Labels: pull-request-available
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)