[ https://issues.apache.org/jira/browse/IMPALA-14349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zoltán Borók-Nagy reassigned IMPALA-14349: ------------------------------------------ Assignee: Zoltán Borók-Nagy > Encode FileDescriptors in time in loading Iceberg Tables > -------------------------------------------------------- > > Key: IMPALA-14349 > URL: https://issues.apache.org/jira/browse/IMPALA-14349 > Project: IMPALA > Issue Type: Improvement > Components: Catalog > Reporter: Quanlong Huang > Assignee: Zoltán Borók-Nagy > Priority: Major > Labels: iceberg > > When loading file metadata of an IcebergTable in > IcebergFileMetadataLoader#loadInternal() -> parallelListing(), we maintain a > map from paths to FileStatus objects: > [https://github.com/apache/impala/blob/50926b5d8e941c5cc10fd77d0b4556e3441c41e7/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java#L171] > This map consumes lot of memory space since the loaded FileStatus objects are > in HdfsLocatedFileStatus type and each of them consumes 6KB of the memory. > E.g. > {noformat} > Class Name > | Shallow Heap | Retained Heap > ---------------------------------------------------------------------------------------------------------------------------------------- > org.apache.hadoop.hdfs.protocol.HdfsLocatedFileStatus @ 0x1008511620 > | 120 | 6,192 > |- <class> class org.apache.hadoop.hdfs.protocol.HdfsLocatedFileStatus @ > 0x1009e2a058 | 16 | 40 > |- isdir java.lang.Boolean @ 0x10056a7638 false > | 16 | 16 > |- path org.apache.hadoop.fs.Path @ 0x1008511310 > | 16 | 784 > |- permission org.apache.hadoop.hdfs.protocol.FsPermissionExtension @ > 0x1008511698 | 32 | 32 > |- owner java.lang.String @ 0x10085116b8 id971832 > | 24 | 48 > |- group java.lang.String @ 0x10085116e8 hive > | 24 | 48 > |- attr java.util.RegularEnumSet @ 0x1008511718 > | 32 | 32 > |- locations org.apache.hadoop.fs.BlockLocation[1] @ 0x1008511738 > | 24 | 192 > |- uPath byte[62] @ 0x1008511838 > 00668-28396-9dd59fc9-3ed9-40ca-8f39-e68bd2724c14-00040.parquet | > 80 | 80 > |- hdfsloc org.apache.hadoop.hdfs.protocol.LocatedBlocks @ 0x1008511888 > | 40 | 5,576 > | |- <class> class org.apache.hadoop.hdfs.protocol.LocatedBlocks @ > 0x1009e20278 | 8 | 512 > | |- blocks java.util.ArrayList @ 0x10085118b0 > | 24 | 2,760 > | | |- <class> class java.util.ArrayList @ 0x100573da10 System Class > | 32 | 240 > | | |- elementData java.lang.Object[1] @ 0x10085118c8 > | 24 | 2,736 > | | | |- class java.lang.Object[] @ 0x1005fc4650 > | 0 | 0 > | | | |- [0] org.apache.hadoop.hdfs.protocol.LocatedBlock @ 0x10085118e0 > | 48 | 2,712 > | | | | |- <class> class org.apache.hadoop.hdfs.protocol.LocatedBlock @ > 0x1009e26700 | 16 | 424 > | | | | |- storageIDs java.lang.String[3] @ 0x10085117f8 > | 32 | 32 > | | | | |- storageTypes org.apache.hadoop.fs.StorageType[3] @ > 0x1008511818 | 32 | 32 > | | | | |- b org.apache.hadoop.hdfs.protocol.ExtendedBlock @ 0x1008511910 > | 24 | 64 > | | | | |- locs > org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[3] @ 0x1008511950 > | 32 | 2,456 > | | | | | |- class > org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[] @ 0x102005b000 > | 0 | 0 > | | | | | |- [2] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage > @ 0x1008511970 | 200 | 808 > | | | | | |- [1] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage > @ 0x1008511c98 | 200 | 808 > | | | | | |- [0] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage > @ 0x1008511fc0 | 200 | 808 > | | | | | '- Total: 4 entries > | | > | | | | |- blockToken org.apache.hadoop.security.token.Token @ > 0x10085122e8 | 32 | 144 > | | | | |- cachedLocs > org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[0] @ 0x101b01f328 > | 16 | 16 > | | | | '- Total: 7 entries > | | > | | | '- Total: 2 entries > | | > | | '- Total: 2 entries > | | > | |- lastLocatedBlock org.apache.hadoop.hdfs.protocol.LocatedBlock @ > 0x1008512378 | 48 | 2,776 > | | |- <class> class org.apache.hadoop.hdfs.protocol.LocatedBlock @ > 0x1009e26700 | 16 | 424 > | | |- b org.apache.hadoop.hdfs.protocol.ExtendedBlock @ 0x10085123a8 > | 24 | 64 > | | |- locs org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[3] @ > 0x10085123e8 | 32 | 2,216 > | | | |- class org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[] @ > 0x102005b000 | 0 | 0 > | | | |- [2] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage @ > 0x1008512408 | 200 | 728 > | | | |- [1] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage @ > 0x1008512730 | 200 | 728 > | | | |- [0] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage @ > 0x1008512a58 | 200 | 728 > | | | | |- <class> class > org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage @ 0x102005aee8 | > 8 | 104 > | | | | |- ipAddr java.lang.String @ 0x1008512b20 xxx.xxx.xxx.xxx > | 24 | 56 > | | | | |- ipAddrBytes com.google.protobuf.LiteralByteString @ > 0x1008512b58 | 24 | 56 > | | | | |- hostName java.lang.String @ 0x1008512b90 www.abc.com > | 24 | 56 > | | | | |- hostNameBytes com.google.protobuf.LiteralByteString @ > 0x1008512bc8 | 24 | 56 > | | | | |- xferAddr java.lang.String @ 0x1008512c00 xxx.xxx.xxx.xxx:9866 > | 24 | 64 > | | | | |- datanodeUuid java.lang.String @ 0x1008512c40 > 2f6e6e42-9347-4370-a318-79efdadcc3cf | 24 | 80 > | | | | |- datanodeUuidBytes com.google.protobuf.LiteralByteString @ > 0x1008512c90 | 24 | 80 > | | | | |- location java.lang.String @ 0x1008512ce0 /default > | 24 | 48 > | | | | |- dependentHostNames java.util.LinkedList @ 0x1008512d10 > | 32 | 32 > | | | | |- storageID java.lang.String @ 0x1008512d30 > DS-f190d2ef-755b-4f73-bb3d-67b6e72805e2 | 24 | 80 > | | | | |- adminState > org.apache.hadoop.hdfs.protocol.DatanodeInfo$AdminStates @ 0x101b01ef50 > NORMAL| 24 | 24 > | | | | |- storageType org.apache.hadoop.fs.StorageType @ 0x101b01f000 > DISK | 24 | 24 > | | | | '- Total: 13 entries > | | > | | | '- Total: 4 entries > | | > | | |- storageIDs java.lang.String[3] @ 0x1008512d80 > | 32 | 32 > | | |- storageTypes org.apache.hadoop.fs.StorageType[3] @ 0x1008512da0 > | 32 | 32 > | | |- blockToken org.apache.hadoop.security.token.Token @ 0x1008512dc0 > | 32 | 144 > | | |- cachedLocs > org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[0] @ 0x101b01f328 > | 16 | 16 > | | '- Total: 7 entries > | | > | '- Total: 3 entries > | | > '- Total: 10 entries{noformat} > There are some duplicate strings like storageIDs and hostnames. We can invoke > String.intern() on them to save some space. But it'd be better to convert > these FileStatus objects into IcebergFileDescriptor in time to reduce the > space usage. Encoding IcebergFileDescriptor into bytes (which usually takes > 200 bytes for each file) in time can further save more space. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org