Quanlong Huang created IMPALA-14349:
---------------------------------------
Summary: Encode FileDescriptors in time in loading Iceberg Tables
Key: IMPALA-14349
URL: https://issues.apache.org/jira/browse/IMPALA-14349
Project: IMPALA
Issue Type: Improvement
Components: Catalog
Reporter: Quanlong Huang
When loading file metadata of an IcebergTable in
IcebergFileMetadataLoader#loadInternal() -> parallelListing(), we maintain a
map from paths to FileStatus objects:
[https://github.com/apache/impala/blob/50926b5d8e941c5cc10fd77d0b4556e3441c41e7/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java#L171]
This map consumes lot of memory space since the loaded FileStatus objects are
in HdfsLocatedFileStatus type and each of them consumes 6KB of the memory. E.g.
{noformat}
Class Name
| Shallow Heap | Retained Heap
----------------------------------------------------------------------------------------------------------------------------------------
org.apache.hadoop.hdfs.protocol.HdfsLocatedFileStatus @ 0x1008511620
| 120 | 6,192
|- <class> class org.apache.hadoop.hdfs.protocol.HdfsLocatedFileStatus @
0x1009e2a058 | 16 | 40
|- isdir java.lang.Boolean @ 0x10056a7638 false
| 16 | 16
|- path org.apache.hadoop.fs.Path @ 0x1008511310
| 16 | 784
|- permission org.apache.hadoop.hdfs.protocol.FsPermissionExtension @
0x1008511698 | 32 | 32
|- owner java.lang.String @ 0x10085116b8 id971832
| 24 | 48
|- group java.lang.String @ 0x10085116e8 hive
| 24 | 48
|- attr java.util.RegularEnumSet @ 0x1008511718
| 32 | 32
|- locations org.apache.hadoop.fs.BlockLocation[1] @ 0x1008511738
| 24 | 192
|- uPath byte[62] @ 0x1008511838
00668-28396-9dd59fc9-3ed9-40ca-8f39-e68bd2724c14-00040.parquet |
80 | 80
|- hdfsloc org.apache.hadoop.hdfs.protocol.LocatedBlocks @ 0x1008511888
| 40 | 5,576
| |- <class> class org.apache.hadoop.hdfs.protocol.LocatedBlocks @
0x1009e20278 | 8 | 512
| |- blocks java.util.ArrayList @ 0x10085118b0
| 24 | 2,760
| | |- <class> class java.util.ArrayList @ 0x100573da10 System Class
| 32 | 240
| | |- elementData java.lang.Object[1] @ 0x10085118c8
| 24 | 2,736
| | | |- class java.lang.Object[] @ 0x1005fc4650
| 0 | 0
| | | |- [0] org.apache.hadoop.hdfs.protocol.LocatedBlock @ 0x10085118e0
| 48 | 2,712
| | | | |- <class> class org.apache.hadoop.hdfs.protocol.LocatedBlock @
0x1009e26700 | 16 | 424
| | | | |- storageIDs java.lang.String[3] @ 0x10085117f8
| 32 | 32
| | | | |- storageTypes org.apache.hadoop.fs.StorageType[3] @ 0x1008511818
| 32 | 32
| | | | |- b org.apache.hadoop.hdfs.protocol.ExtendedBlock @ 0x1008511910
| 24 | 64
| | | | |- locs org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[3]
@ 0x1008511950 | 32 | 2,456
| | | | | |- class
org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[] @ 0x102005b000
| 0 | 0
| | | | | |- [2] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage @
0x1008511970 | 200 | 808
| | | | | |- [1] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage @
0x1008511c98 | 200 | 808
| | | | | |- [0] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage @
0x1008511fc0 | 200 | 808
| | | | | '- Total: 4 entries
| |
| | | | |- blockToken org.apache.hadoop.security.token.Token @ 0x10085122e8
| 32 | 144
| | | | |- cachedLocs
org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[0] @ 0x101b01f328
| 16 | 16
| | | | '- Total: 7 entries
| |
| | | '- Total: 2 entries
| |
| | '- Total: 2 entries
| |
| |- lastLocatedBlock org.apache.hadoop.hdfs.protocol.LocatedBlock @
0x1008512378 | 48 | 2,776
| | |- <class> class org.apache.hadoop.hdfs.protocol.LocatedBlock @
0x1009e26700 | 16 | 424
| | |- b org.apache.hadoop.hdfs.protocol.ExtendedBlock @ 0x10085123a8
| 24 | 64
| | |- locs org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[3] @
0x10085123e8 | 32 | 2,216
| | | |- class org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[] @
0x102005b000 | 0 | 0
| | | |- [2] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage @
0x1008512408 | 200 | 728
| | | |- [1] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage @
0x1008512730 | 200 | 728
| | | |- [0] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage @
0x1008512a58 | 200 | 728
| | | | |- <class> class
org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage @ 0x102005aee8 |
8 | 104
| | | | |- ipAddr java.lang.String @ 0x1008512b20 xxx.xxx.xxx.xxx
| 24 | 56
| | | | |- ipAddrBytes com.google.protobuf.LiteralByteString @ 0x1008512b58
| 24 | 56
| | | | |- hostName java.lang.String @ 0x1008512b90 www.abc.com
| 24 | 56
| | | | |- hostNameBytes com.google.protobuf.LiteralByteString @
0x1008512bc8 | 24 | 56
| | | | |- xferAddr java.lang.String @ 0x1008512c00 xxx.xxx.xxx.xxx:9866
| 24 | 64
| | | | |- datanodeUuid java.lang.String @ 0x1008512c40
2f6e6e42-9347-4370-a318-79efdadcc3cf | 24 | 80
| | | | |- datanodeUuidBytes com.google.protobuf.LiteralByteString @
0x1008512c90 | 24 | 80
| | | | |- location java.lang.String @ 0x1008512ce0 /default
| 24 | 48
| | | | |- dependentHostNames java.util.LinkedList @ 0x1008512d10
| 32 | 32
| | | | |- storageID java.lang.String @ 0x1008512d30
DS-f190d2ef-755b-4f73-bb3d-67b6e72805e2 | 24 | 80
| | | | |- adminState
org.apache.hadoop.hdfs.protocol.DatanodeInfo$AdminStates @ 0x101b01ef50
NORMAL| 24 | 24
| | | | |- storageType org.apache.hadoop.fs.StorageType @ 0x101b01f000
DISK | 24 | 24
| | | | '- Total: 13 entries
| |
| | | '- Total: 4 entries
| |
| | |- storageIDs java.lang.String[3] @ 0x1008512d80
| 32 | 32
| | |- storageTypes org.apache.hadoop.fs.StorageType[3] @ 0x1008512da0
| 32 | 32
| | |- blockToken org.apache.hadoop.security.token.Token @ 0x1008512dc0
| 32 | 144
| | |- cachedLocs org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[0]
@ 0x101b01f328 | 16 | 16
| | '- Total: 7 entries
| |
| '- Total: 3 entries
| |
'- Total: 10 entries{noformat}
There are some duplicate strings like storageIDs and hostnames. We can invoke
String.intern() on them to save some space. But it'd be better to convert these
FileStatus objects into IcebergFileDescriptor in time to reduce the space
usage. Encoding IcebergFileDescriptor into bytes (which usually takes 200 bytes
for each file) in time can further save more space.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]