Zoltán Borók-Nagy created IMPALA-12477:
------------------------------------------

             Summary: Make Iceberg planFiles() use multiple threads
                 Key: IMPALA-12477
                 URL: https://issues.apache.org/jira/browse/IMPALA-12477
             Project: IMPALA
          Issue Type: Bug
          Components: Catalog, Frontend
            Reporter: Zoltán Borók-Nagy


Impala is not using Iceberg’s planFiles() API in a performant way:
[https://github.com/apache/impala/blob/2d3289027c2ffdd245d13b60e6fa3f9b3e7bf833/fe/[…]java/org/apache/impala/catalog/iceberg/GroupedContentFiles.java|https://github.com/apache/impala/blob/2d3289027c2ffdd245d13b60e6fa3f9b3e7bf833/fe/src/main/java/org/apache/impala/catalog/iceberg/GroupedContentFiles.java#L46]


Instead of a for-loop we should use a forEach() like Hive does:
[https://github.com/apache/hive/blob/071b721d8d73cc4d5d2d9469d7953bdc75ff615f/icebe[…]in/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java|https://github.com/apache/hive/blob/071b721d8d73cc4d5d2d9469d7953bdc75ff615f/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java#L222]

The forEach() spreads the work across multiple threads.

This will not just improve table loading  times, but also improves queries that 
use planFiles(), e.g. queries that push down predicates to Iceberg and 
time-travel queries.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to