Zoltán Borók-Nagy created IMPALA-12477: ------------------------------------------
Summary: Make Iceberg planFiles() use multiple threads
Key: IMPALA-12477
URL: https://issues.apache.org/jira/browse/IMPALA-12477
Project: IMPALA
Issue Type: Bug
Components: Catalog, Frontend
Reporter: Zoltán Borók-Nagy
Impala is not using Iceberg’s planFiles() API in a performant way:
[https://github.com/apache/impala/blob/2d3289027c2ffdd245d13b60e6fa3f9b3e7bf833/fe/[…]java/org/apache/impala/catalog/iceberg/GroupedContentFiles.java|https://github.com/apache/impala/blob/2d3289027c2ffdd245d13b60e6fa3f9b3e7bf833/fe/src/main/java/org/apache/impala/catalog/iceberg/GroupedContentFiles.java#L46]
Instead of a for-loop we should use a forEach() like Hive does:
[https://github.com/apache/hive/blob/071b721d8d73cc4d5d2d9469d7953bdc75ff615f/icebe[…]in/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java|https://github.com/apache/hive/blob/071b721d8d73cc4d5d2d9469d7953bdc75ff615f/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java#L222]
The forEach() spreads the work across multiple threads.
This will not just improve table loading times, but also improves queries that
use planFiles(), e.g. queries that push down predicates to Iceberg and
time-travel queries.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
