[jira] [Work logged] (HIVE-26137) Optimized transfer of Iceberg residual expressions from AM to execution

ASF GitHub Bot (Jira) Tue, 12 Apr 2022 07:37:07 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-26137?focusedWorklogId=755789&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-755789
 ]


ASF GitHub Bot logged work on HIVE-26137:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 12/Apr/22 14:36
            Start Date: 12/Apr/22 14:36
    Worklog Time Spent: 10m 
      Work Description: szlta opened a new pull request, #3203:
URL: https://github.com/apache/hive/pull/3203

   The filter expression that goes with the file scan tasks is actually not a 
"residual" one, but rather the original data filter. This is good for us, as 
now we know that for any Hive job the expression is the same object - so we can 
transfer it another way to Hive execution processes:
   
   The expression itself is generated via 
https://github.com/apache/iceberg/blob/master/mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergInputFormat.java#L82-L93
 before split generation within the AM. There's nothing to prevent us from 
reusing this same logic on the executors.
   At the same time we can ask ignoreResiduals() on the table scan, so that 
Iceberg only uses the filter for split generation, but won't actually attach it 
to the file scan tasks, and therefore their enwrapping splits. On the execution 
side we can just simply retrieve the original filter expression by the logic 
above and evaluate it against the current task (whose spec and partition value 
information are present anyway), ending up with the actual residual expression 
for the task. This is then passed to the underlying file formats the same way 
as before.




Issue Time Tracking
-------------------

            Worklog Id:     (was: 755789)
    Remaining Estimate: 0h
            Time Spent: 10m

> Optimized transfer of Iceberg residual expressions from AM to execution
> -----------------------------------------------------------------------
>
>                 Key: HIVE-26137
>                 URL: https://issues.apache.org/jira/browse/HIVE-26137
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Ádám Szita
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> HIVE-25967 introduced a hack to prevent Iceberg filter expressions to be 
> serialized into splits. This temporary fix was to avoid OOM problems on Tez 
> AM side, but at the same time prevented predicate pushdowns to work on the 
> execution side too.
> This ticket intends to incorporate the long term solution. It turns out that 
> the file scan tasks created by Iceberg actually don't contain a "residual" 
> expressions, but rather a complete/original one. It becomes residual only 
> when it is evaluated against the tasks' partition value, which only happens 
> on the execution site. This means that the original filter is the same 
> expression for all splits in Tez AM, so we can transfer it via job conf 
> instead.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-26137) Optimized transfer of Iceberg residual expressions from AM to execution

Reply via email to