Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21608 )

Change subject: IMPALA-13254: Optimize REFRESH for Iceberg tables
......................................................................


Patch Set 11:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/21608/11//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/21608/11//COMMIT_MSG@11
PS11, Line 11: 1. In the IcebergFileMetadataLoader.getFileStatuses function, 
mostly
             :   call parallelListing to quickly retrieve FileStatus, avoiding
             :   the sequential fetching for each contentFile.
> Please share your measurements for a large table with lots of partitions an
Please resolve this comment.


http://gerrit.cloudera.org:8080/#/c/21608/13/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java
File fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java:

http://gerrit.cloudera.org:8080/#/c/21608/13/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java@243
PS13, Line 243:     // and specify the listWithLocations parameter as false
              :     nameToFileStatus = parallelListing(fs,
              :         filesSupportsStorageIds, 
filesSupportsStorageIds.size(), false);
              :     for (ContentFile<?> contentFile : filesSupportsStorageIds) {
              :       Path path = FileSystemUtil.createFullyQualifiedPath(
              :           new Path(contentFile.path().toString()));
              :       FileStatus status = nameToFileStatus.get(path);
              :       stats.add(Preconditions.checkNotNull(status));
              :     }
              :     r
In the old code we created the located file statuses in the for-loop, then we 
could use them in FileMetadataLoader:
https://github.com/apache/impala/blob/af5f3acd6708df623cd2404b3616fe1985c26aab/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java#L302

Here we are creating an extra listing without location info, then we will need 
to get the block locations at 
https://github.com/apache/impala/blob/af5f3acd6708df623cd2404b3616fe1985c26aab/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java#L304

This is all happening in a single thread, so I'd assume we are slower now, as 
we are doing extra file listings. Please add measurements if it is not the case.

What we could do here instead is to parallelize Utils.createLocatedFileStatus() 
calls for the elements in filesSupportsStorageIds.



--
To view, visit http://gerrit.cloudera.org:8080/21608
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c99a28eb16275efdff52e0ea2711c0c6036719c
Gerrit-Change-Number: 21608
Gerrit-PatchSet: 11
Gerrit-Owner: Fu Lili <[email protected]>
Gerrit-Reviewer: Fu Lili <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Thu, 15 Aug 2024 10:40:19 +0000
Gerrit-HasComments: Yes

Reply via email to