Quanlong Huang created IMPALA-14349:
---------------------------------------

             Summary: Encode FileDescriptors in time in loading Iceberg Tables
                 Key: IMPALA-14349
                 URL: https://issues.apache.org/jira/browse/IMPALA-14349
             Project: IMPALA
          Issue Type: Improvement
          Components: Catalog
            Reporter: Quanlong Huang


When loading file metadata of an IcebergTable in 
IcebergFileMetadataLoader#loadInternal() -> parallelListing(), we maintain a 
map from paths to FileStatus objects:

[https://github.com/apache/impala/blob/50926b5d8e941c5cc10fd77d0b4556e3441c41e7/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java#L171]

This map consumes lot of memory space since the loaded FileStatus objects are 
in HdfsLocatedFileStatus type and each of them consumes 6KB of the memory. E.g.
{noformat}
Class Name                                                                      
                         | Shallow Heap | Retained Heap
----------------------------------------------------------------------------------------------------------------------------------------
org.apache.hadoop.hdfs.protocol.HdfsLocatedFileStatus @ 0x1008511620            
                         |          120 |         6,192
|- <class> class org.apache.hadoop.hdfs.protocol.HdfsLocatedFileStatus @ 
0x1009e2a058                    |           16 |            40
|- isdir java.lang.Boolean @ 0x10056a7638  false                                
                         |           16 |            16
|- path org.apache.hadoop.fs.Path @ 0x1008511310                                
                         |           16 |           784
|- permission org.apache.hadoop.hdfs.protocol.FsPermissionExtension @ 
0x1008511698                       |           32 |            32
|- owner java.lang.String @ 0x10085116b8  id971832                              
                         |           24 |            48
|- group java.lang.String @ 0x10085116e8  hive                                  
                         |           24 |            48
|- attr java.util.RegularEnumSet @ 0x1008511718                                 
                         |           32 |            32
|- locations org.apache.hadoop.fs.BlockLocation[1] @ 0x1008511738               
                         |           24 |           192
|- uPath byte[62] @ 0x1008511838  
00668-28396-9dd59fc9-3ed9-40ca-8f39-e68bd2724c14-00040.parquet         |        
   80 |            80
|- hdfsloc org.apache.hadoop.hdfs.protocol.LocatedBlocks @ 0x1008511888         
                         |           40 |         5,576
|  |- <class> class org.apache.hadoop.hdfs.protocol.LocatedBlocks @ 
0x1009e20278                         |            8 |           512
|  |- blocks java.util.ArrayList @ 0x10085118b0                                 
                         |           24 |         2,760
|  |  |- <class> class java.util.ArrayList @ 0x100573da10 System Class          
                         |           32 |           240
|  |  |- elementData java.lang.Object[1] @ 0x10085118c8                         
                         |           24 |         2,736
|  |  |  |- class java.lang.Object[] @ 0x1005fc4650                             
                         |            0 |             0
|  |  |  |- [0] org.apache.hadoop.hdfs.protocol.LocatedBlock @ 0x10085118e0     
                         |           48 |         2,712
|  |  |  |  |- <class> class org.apache.hadoop.hdfs.protocol.LocatedBlock @ 
0x1009e26700                 |           16 |           424
|  |  |  |  |- storageIDs java.lang.String[3] @ 0x10085117f8                    
                         |           32 |            32
|  |  |  |  |- storageTypes org.apache.hadoop.fs.StorageType[3] @ 0x1008511818  
                         |           32 |            32
|  |  |  |  |- b org.apache.hadoop.hdfs.protocol.ExtendedBlock @ 0x1008511910   
                         |           24 |            64
|  |  |  |  |- locs org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[3] 
@ 0x1008511950            |           32 |         2,456
|  |  |  |  |  |- class 
org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[] @ 0x102005b000        
 |            0 |             0
|  |  |  |  |  |- [2] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage @ 
0x1008511970             |          200 |           808
|  |  |  |  |  |- [1] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage @ 
0x1008511c98             |          200 |           808
|  |  |  |  |  |- [0] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage @ 
0x1008511fc0             |          200 |           808
|  |  |  |  |  '- Total: 4 entries                                              
                         |              |
|  |  |  |  |- blockToken org.apache.hadoop.security.token.Token @ 0x10085122e8 
                         |           32 |           144
|  |  |  |  |- cachedLocs 
org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[0] @ 0x101b01f328      
|           16 |            16
|  |  |  |  '- Total: 7 entries                                                 
                         |              |
|  |  |  '- Total: 2 entries                                                    
                         |              |
|  |  '- Total: 2 entries                                                       
                         |              |
|  |- lastLocatedBlock org.apache.hadoop.hdfs.protocol.LocatedBlock @ 
0x1008512378                       |           48 |         2,776
|  |  |- <class> class org.apache.hadoop.hdfs.protocol.LocatedBlock @ 
0x1009e26700                       |           16 |           424
|  |  |- b org.apache.hadoop.hdfs.protocol.ExtendedBlock @ 0x10085123a8         
                         |           24 |            64
|  |  |- locs org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[3] @ 
0x10085123e8                  |           32 |         2,216
|  |  |  |- class org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[] @ 
0x102005b000               |            0 |             0
|  |  |  |- [2] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage @ 
0x1008512408                   |          200 |           728
|  |  |  |- [1] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage @ 
0x1008512730                   |          200 |           728
|  |  |  |- [0] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage @ 
0x1008512a58                   |          200 |           728
|  |  |  |  |- <class> class 
org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage @ 0x102005aee8      |   
         8 |           104
|  |  |  |  |- ipAddr java.lang.String @ 0x1008512b20  xxx.xxx.xxx.xxx          
                         |           24 |            56
|  |  |  |  |- ipAddrBytes com.google.protobuf.LiteralByteString @ 0x1008512b58 
                         |           24 |            56
|  |  |  |  |- hostName java.lang.String @ 0x1008512b90  www.abc.com            
                         |           24 |            56
|  |  |  |  |- hostNameBytes com.google.protobuf.LiteralByteString @ 
0x1008512bc8                        |           24 |            56
|  |  |  |  |- xferAddr java.lang.String @ 0x1008512c00  xxx.xxx.xxx.xxx:9866   
                         |           24 |            64
|  |  |  |  |- datanodeUuid java.lang.String @ 0x1008512c40  
2f6e6e42-9347-4370-a318-79efdadcc3cf        |           24 |            80
|  |  |  |  |- datanodeUuidBytes com.google.protobuf.LiteralByteString @ 
0x1008512c90                    |           24 |            80
|  |  |  |  |- location java.lang.String @ 0x1008512ce0  /default               
                         |           24 |            48
|  |  |  |  |- dependentHostNames java.util.LinkedList @ 0x1008512d10           
                         |           32 |            32
|  |  |  |  |- storageID java.lang.String @ 0x1008512d30  
DS-f190d2ef-755b-4f73-bb3d-67b6e72805e2        |           24 |            80
|  |  |  |  |- adminState 
org.apache.hadoop.hdfs.protocol.DatanodeInfo$AdminStates @ 0x101b01ef50  
NORMAL|           24 |            24
|  |  |  |  |- storageType org.apache.hadoop.fs.StorageType @ 0x101b01f000  
DISK                         |           24 |            24
|  |  |  |  '- Total: 13 entries                                                
                         |              |
|  |  |  '- Total: 4 entries                                                    
                         |              |
|  |  |- storageIDs java.lang.String[3] @ 0x1008512d80                          
                         |           32 |            32
|  |  |- storageTypes org.apache.hadoop.fs.StorageType[3] @ 0x1008512da0        
                         |           32 |            32
|  |  |- blockToken org.apache.hadoop.security.token.Token @ 0x1008512dc0       
                         |           32 |           144
|  |  |- cachedLocs org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[0] 
@ 0x101b01f328            |           16 |            16
|  |  '- Total: 7 entries                                                       
                         |              |
|  '- Total: 3 entries                                                          
                         |              |
'- Total: 10 entries{noformat}

There are some duplicate strings like storageIDs and hostnames. We can invoke 
String.intern() on them to save some space. But it'd be better to convert these 
FileStatus objects into IcebergFileDescriptor in time to reduce the space 
usage. Encoding IcebergFileDescriptor into bytes (which usually takes 200 bytes 
for each file) in time can further save more space.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to