szlta commented on PR #4512:
URL: https://github.com/apache/iceberg/pull/4512#issuecomment-1096818564

   I propose another way to solve this for Hive (I'm not sure if this is an 
issue for other engines, or if they even serialize the tasks the same way a Tez 
AM does in Hive ecosystem..)
   
   As @rdblue mentioned, the expression that goes with the file scan tasks is 
actually not a "residual" one, but rather the original data filter. This is 
good for us, as now we know that for any Hive job the expression is the same 
object - so we can transfer it another way to Hive execution processes:
   
   - The expression itself is generated via 
https://github.com/apache/iceberg/blob/master/mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergInputFormat.java#L82-L93
 before split generation within the AM. There's nothing to prevent us from 
reusing this same logic on the executors.
   - At the same time we can ask ignoreResiduals() on the table scan, so that 
Iceberg only uses the filter for split generation, but won't actually attach it 
to the file scan tasks, and therefore their enwrapping splits.
   - On the execution side we can just simply retrieve the original filter 
expression by the logic above and evaluate it against the current task (whose 
spec and partition value information are present anyway), ending up with the 
actual residual expression for the task. This is then passed to the underlying 
file formats the same way as before.
   
   I've opened a PR for this in the Hive repo for now: 
https://github.com/apache/hive/pull/3203
   It looks like there will be some bigger refactors required to separate MR 
and Hive related stuff in the MR module in the Iceberg repo before this 
solution could be ported here too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to