[
https://issues.apache.org/jira/browse/IMPALA-12298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17749765#comment-17749765
]
ASF subversion and git services commented on IMPALA-12298:
----------------------------------------------------------
Commit c8422136962b8d08e5f44d8351fb4fe7cdb675b8 in impala's branch
refs/heads/master from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=c84221369 ]
IMPALA-12298: Improve incremental load of Iceberg tables
Currently Impala reloads the whole table with all its metadata
when a table is updated. Even if there are no files modififed, or
only a few file added. This hurts performance for large tables,
especially when Hadoop RPC encryption is enabled. See HADOOP-14558 and
HADOOP-10768 for details.
This patch adds an optimization to only load the newly added files
if their number are under a threshold. The threshold can be set by
the backend flag 'iceberg_reload_new_files_threshold' (100 by default).
If there are more files than the threshold, we fallback to the old
behavior.
Testing:
* added Unit test
* manually checked the TRACE logs of IcebergFileMetadataLoader
Change-Id: Icf643798a93e74ae7b0f37ceeab0a8052fb2699d
Reviewed-on: http://gerrit.cloudera.org:8080/20271
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Improve incremental load of Iceberg tables
> ------------------------------------------
>
> Key: IMPALA-12298
> URL: https://issues.apache.org/jira/browse/IMPALA-12298
> Project: IMPALA
> Issue Type: Bug
> Components: Catalog
> Reporter: Zoltán Borók-Nagy
> Assignee: Zoltán Borók-Nagy
> Priority: Major
> Labels: impala-iceberg, performance
>
> *The followings mostly affect HDFS/Ozone where we need to contact the
> NameNode to create file descriptors with block locations. On cloud object
> stores where there are no block locations, we only need the Iceberg metadata
> to create the file descriptors.*
> Currently we always reload all the metadata belonging to an Iceberg table.
> This means we recreate all the file descriptors even if only a few of them
> have changed.
> We could check the amount of the newly added files, and if there's only a few
> of them then we should only load the file descriptors for those one by one.
> We can fallback to a full reload if a significant amount of files have
> changed, i.e. when it is better to use a recursive file listing.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]