Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/18160 )
Change subject: IMPALA-11076: Re-use FDs loaded by HdfsTable during IcebergTable load ...................................................................... Patch Set 1: (2 comments) http://gerrit.cloudera.org:8080/#/c/18160/1/fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java File fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java: http://gerrit.cloudera.org:8080/#/c/18160/1/fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java@472 PS1, Line 472: /** : * Get FileDescriptor by data file location : */ : public static HdfsPartition.FileDescriptor getFileDescriptor(Path fileLoc, : Path tableLoc, ListMap<TNetworkAddress> hostIndex) throws IOException { : FileSystem fs = FileSystemUtil.getFileSystemForPath(tableLoc); : FileStatus fileStatus = fs.getFileStatus(fileLoc); : return getFileDescriptor(fs, tableLoc, fileStatus, hostIndex); : } : : private static HdfsPartition.FileDescriptor getFileDescriptor(FileSystem fs, : Path tableLoc, FileStatus fileStatus, ListMap<TNetworkAddress> hostIndex) : throws IOException { : Reference<Long> numUnknownDiskIds = new Reference<Long>(Long.valueOf(0)); : String relPath = FileSystemUtil.relativizePath(fileStatus.getPath(), tableLoc); : : if (!FileSystemUtil.supportsStorageIds(fs)) { : return HdfsPartition.FileDescriptor.createWithNoBlocks(fileStatus, relPath); : } : : BlockLocation[] locations; : if (fileStatus instanceof LocatedFileStatus) { : locations = ((LocatedFileStatus)fileStatus).getBlockLocations(); : } else { : locations = fs.getFileBlockLocations(fileStatus, 0, fileStatus.getLen()); : } : : return HdfsPartition.FileDescriptor.create(fileStatus, relPath, locations, : hostIndex, HdfsShim.isErasureCoded(fileStatus), numUnknownDiskIds); : } Becomes unused? Or maybe we can keep it as a fallback? E.g. generate a warning when we cannot find the file descriptor in the HDFS table, but load it anyway with these methods? http://gerrit.cloudera.org:8080/#/c/18160/1/fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java@516 PS1, Line 516: fileDescMap We should only put entries to the fileDescMap if they are actually in the Iceberg data file list. E.g. if the table is compacted then the 'data' directory potentially contain much more files than the current snapshot. Also, later we'll need to decorate the file descriptors with information coming from the Iceberg data files (e.g. which partition they belong, their file format, etc.). So I think we should create a temporary map of the file descriptors in the HDFS table (only using paths under the 'data' directory) and use it as a lookup table to populate another map ('fileDescMap') that only contains the snapshot data files. -- To view, visit http://gerrit.cloudera.org:8080/18160 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Id49daec60237dd6c8cce92c00a09fd9a6e6479b4 Gerrit-Change-Number: 18160 Gerrit-PatchSet: 1 Gerrit-Owner: Tamas Mate <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Thu, 20 Jan 2022 19:29:31 +0000 Gerrit-HasComments: Yes
