Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/21608 )
Change subject: IMPALA-13254: Optimize REFRESH for Iceberg tables ...................................................................... Patch Set 11: (2 comments) http://gerrit.cloudera.org:8080/#/c/21608/11//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/21608/11//COMMIT_MSG@11 PS11, Line 11: 1. In the IcebergFileMetadataLoader.getFileStatuses function, mostly : call parallelListing to quickly retrieve FileStatus, avoiding : the sequential fetching for each contentFile. > Please share your measurements for a large table with lots of partitions an Please resolve this comment. http://gerrit.cloudera.org:8080/#/c/21608/13/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java File fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java: http://gerrit.cloudera.org:8080/#/c/21608/13/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java@243 PS13, Line 243: // and specify the listWithLocations parameter as false : nameToFileStatus = parallelListing(fs, : filesSupportsStorageIds, filesSupportsStorageIds.size(), false); : for (ContentFile<?> contentFile : filesSupportsStorageIds) { : Path path = FileSystemUtil.createFullyQualifiedPath( : new Path(contentFile.path().toString())); : FileStatus status = nameToFileStatus.get(path); : stats.add(Preconditions.checkNotNull(status)); : } : r In the old code we created the located file statuses in the for-loop, then we could use them in FileMetadataLoader: https://github.com/apache/impala/blob/af5f3acd6708df623cd2404b3616fe1985c26aab/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java#L302 Here we are creating an extra listing without location info, then we will need to get the block locations at https://github.com/apache/impala/blob/af5f3acd6708df623cd2404b3616fe1985c26aab/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java#L304 This is all happening in a single thread, so I'd assume we are slower now, as we are doing extra file listings. Please add measurements if it is not the case. What we could do here instead is to parallelize Utils.createLocatedFileStatus() calls for the elements in filesSupportsStorageIds. -- To view, visit http://gerrit.cloudera.org:8080/21608 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I8c99a28eb16275efdff52e0ea2711c0c6036719c Gerrit-Change-Number: 21608 Gerrit-PatchSet: 11 Gerrit-Owner: Fu Lili <[email protected]> Gerrit-Reviewer: Fu Lili <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Thu, 15 Aug 2024 10:40:19 +0000 Gerrit-HasComments: Yes
