[jira] [Commented] (IMPALA-12298) Improve incremental load of Iceberg tables

ASF subversion and git services (Jira) Tue, 01 Aug 2023 06:48:05 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-12298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17749765#comment-17749765
 ]


ASF subversion and git services commented on IMPALA-12298:
----------------------------------------------------------

Commit c8422136962b8d08e5f44d8351fb4fe7cdb675b8 in impala's branch 
refs/heads/master from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=c84221369 ]

IMPALA-12298: Improve incremental load of Iceberg tables

Currently Impala reloads the whole table with all its metadata
when a table is updated. Even if there are no files modififed, or
only a few file added. This hurts performance for large tables,
especially when Hadoop RPC encryption is enabled. See HADOOP-14558 and
HADOOP-10768 for details.

This patch adds an optimization to only load the newly added files
if their number are under a threshold. The threshold can be set by
the backend flag 'iceberg_reload_new_files_threshold' (100 by default).
If there are more files than the threshold, we fallback to the old
behavior.

Testing:
 * added Unit test
 * manually checked the TRACE logs of IcebergFileMetadataLoader

Change-Id: Icf643798a93e74ae7b0f37ceeab0a8052fb2699d
Reviewed-on: http://gerrit.cloudera.org:8080/20271
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Improve incremental load of Iceberg tables
> ------------------------------------------
>
>                 Key: IMPALA-12298
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12298
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Catalog
>            Reporter: Zoltán Borók-Nagy
>            Assignee: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: impala-iceberg, performance
>
> *The followings mostly affect HDFS/Ozone where we need to contact the 
> NameNode to create file descriptors with block locations. On cloud object 
> stores where there are no block locations, we only need the Iceberg metadata 
> to create the file descriptors.*
> Currently we always reload all the metadata belonging to an Iceberg table.
> This means we recreate all the file descriptors even if only a few of them 
> have changed.
> We could check the amount of the newly added files, and if there's only a few 
> of them then we should only load the file descriptors for those one by one.
> We can fallback to a full reload if a significant amount of files have 
> changed, i.e. when it is better to use a recursive file listing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-12298) Improve incremental load of Iceberg tables

Reply via email to