[
https://issues.apache.org/jira/browse/IMPALA-12477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zoltán Borók-Nagy resolved IMPALA-12477.
----------------------------------------
Resolution: Won't Fix
After further analysis there is actually no difference between the for-loop and
the forEach().
> Make Iceberg planFiles() use multiple threads
> ---------------------------------------------
>
> Key: IMPALA-12477
> URL: https://issues.apache.org/jira/browse/IMPALA-12477
> Project: IMPALA
> Issue Type: Bug
> Components: Catalog, Frontend
> Reporter: Zoltán Borók-Nagy
> Assignee: Zoltán Borók-Nagy
> Priority: Major
> Labels: impala-iceberg, performance
>
> Impala is not using Iceberg’s planFiles() API in a performant way:
> [https://github.com/apache/impala/blob/2d3289027c2ffdd245d13b60e6fa3f9b3e7bf833/fe/[…]java/org/apache/impala/catalog/iceberg/GroupedContentFiles.java|https://github.com/apache/impala/blob/2d3289027c2ffdd245d13b60e6fa3f9b3e7bf833/fe/src/main/java/org/apache/impala/catalog/iceberg/GroupedContentFiles.java#L46]
> Instead of a for-loop we should use a forEach() like Hive does:
> [https://github.com/apache/hive/blob/071b721d8d73cc4d5d2d9469d7953bdc75ff615f/icebe[…]in/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java|https://github.com/apache/hive/blob/071b721d8d73cc4d5d2d9469d7953bdc75ff615f/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java#L222]
> The forEach() spreads the work across multiple threads.
> This will not just improve table loading times, but also improves queries
> that use planFiles(), e.g. queries that push down predicates to Iceberg and
> time-travel queries.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]