[ 
https://issues.apache.org/jira/browse/HIVE-29646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18089401#comment-18089401
 ] 

Stamatis Zampetakis commented on HIVE-29646:
--------------------------------------------

{quote}Ideally, statistics should be kept up to date, either through 
incremental maintenance or via scheduled recomputation. If we start from the 
assumption that statistics on external tables are inherently untrustworthy, 
then I think a large class of cost-based optimizations becomes questionable, 
not just runtime filtering.

In Iceberg table-level basic statistics are derived directly from the current 
snapshot metadata and are therefore accurate by construction. The concern is 
more around partition-level/column statistics, which can become stale when data 
is modified. 
{quote}
I fully agree. I don't know why runtime filtering became a special case but the 
overall reasoning makes sense to me.
{quote}The property can still be used to guard operations where stale metadata 
may affect query correctness, while allowing statistics-based planning 
optimizations.
{quote}
I don't understand how the property can act as a guard with the way the code is 
right now. Are you proposing to re-purpose the property for other needs or are 
you referring to the actual state of the code in the repo/PR?

Currently, the property guards against SMB/Bucket join and runtime filtering 
transformation on external tables. It does not have any effect on query 
answering based on stats. Stat based optimizations on external tables are 
explicitly prohibited and this is not configurable at the moment. 

At this point, I am convinced by the proposal to lift the limitation of 
performing these optimizations on external tables. The discussion above is 
mainly to clarify the new purpose/effects of the existing property assuming 
that we leave it in place.

 

> Enable semijoin reduction and map-join conversion on external tables with 
> accurate statistics
> ---------------------------------------------------------------------------------------------
>
>                 Key: HIVE-29646
>                 URL: https://issues.apache.org/jira/browse/HIVE-29646
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Denys Kuzmenko
>            Priority: Major
>              Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to