[
https://issues.apache.org/jira/browse/HIVE-26137?focusedWorklogId=755789&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-755789
]
ASF GitHub Bot logged work on HIVE-26137:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 12/Apr/22 14:36
Start Date: 12/Apr/22 14:36
Worklog Time Spent: 10m
Work Description: szlta opened a new pull request, #3203:
URL: https://github.com/apache/hive/pull/3203
The filter expression that goes with the file scan tasks is actually not a
"residual" one, but rather the original data filter. This is good for us, as
now we know that for any Hive job the expression is the same object - so we can
transfer it another way to Hive execution processes:
The expression itself is generated via
https://github.com/apache/iceberg/blob/master/mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergInputFormat.java#L82-L93
before split generation within the AM. There's nothing to prevent us from
reusing this same logic on the executors.
At the same time we can ask ignoreResiduals() on the table scan, so that
Iceberg only uses the filter for split generation, but won't actually attach it
to the file scan tasks, and therefore their enwrapping splits. On the execution
side we can just simply retrieve the original filter expression by the logic
above and evaluate it against the current task (whose spec and partition value
information are present anyway), ending up with the actual residual expression
for the task. This is then passed to the underlying file formats the same way
as before.
Issue Time Tracking
-------------------
Worklog Id: (was: 755789)
Remaining Estimate: 0h
Time Spent: 10m
> Optimized transfer of Iceberg residual expressions from AM to execution
> -----------------------------------------------------------------------
>
> Key: HIVE-26137
> URL: https://issues.apache.org/jira/browse/HIVE-26137
> Project: Hive
> Issue Type: Improvement
> Reporter: Ádám Szita
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> HIVE-25967 introduced a hack to prevent Iceberg filter expressions to be
> serialized into splits. This temporary fix was to avoid OOM problems on Tez
> AM side, but at the same time prevented predicate pushdowns to work on the
> execution side too.
> This ticket intends to incorporate the long term solution. It turns out that
> the file scan tasks created by Iceberg actually don't contain a "residual"
> expressions, but rather a complete/original one. It becomes residual only
> when it is evaluated against the tasks' partition value, which only happens
> on the execution site. This means that the original filter is the same
> expression for all splits in Tez AM, so we can transfer it via job conf
> instead.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)