[ 
https://issues.apache.org/jira/browse/IMPALA-14349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltán Borók-Nagy reassigned IMPALA-14349:
------------------------------------------

    Assignee: Zoltán Borók-Nagy

> Encode FileDescriptors in time in loading Iceberg Tables
> --------------------------------------------------------
>
>                 Key: IMPALA-14349
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14349
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Catalog
>            Reporter: Quanlong Huang
>            Assignee: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: iceberg
>
> When loading file metadata of an IcebergTable in 
> IcebergFileMetadataLoader#loadInternal() -> parallelListing(), we maintain a 
> map from paths to FileStatus objects:
> [https://github.com/apache/impala/blob/50926b5d8e941c5cc10fd77d0b4556e3441c41e7/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java#L171]
> This map consumes lot of memory space since the loaded FileStatus objects are 
> in HdfsLocatedFileStatus type and each of them consumes 6KB of the memory. 
> E.g.
> {noformat}
> Class Name                                                                    
>                            | Shallow Heap | Retained Heap
> ----------------------------------------------------------------------------------------------------------------------------------------
> org.apache.hadoop.hdfs.protocol.HdfsLocatedFileStatus @ 0x1008511620          
>                            |          120 |         6,192
> |- <class> class org.apache.hadoop.hdfs.protocol.HdfsLocatedFileStatus @ 
> 0x1009e2a058                    |           16 |            40
> |- isdir java.lang.Boolean @ 0x10056a7638  false                              
>                            |           16 |            16
> |- path org.apache.hadoop.fs.Path @ 0x1008511310                              
>                            |           16 |           784
> |- permission org.apache.hadoop.hdfs.protocol.FsPermissionExtension @ 
> 0x1008511698                       |           32 |            32
> |- owner java.lang.String @ 0x10085116b8  id971832                            
>                            |           24 |            48
> |- group java.lang.String @ 0x10085116e8  hive                                
>                            |           24 |            48
> |- attr java.util.RegularEnumSet @ 0x1008511718                               
>                            |           32 |            32
> |- locations org.apache.hadoop.fs.BlockLocation[1] @ 0x1008511738             
>                            |           24 |           192
> |- uPath byte[62] @ 0x1008511838  
> 00668-28396-9dd59fc9-3ed9-40ca-8f39-e68bd2724c14-00040.parquet         |      
>      80 |            80
> |- hdfsloc org.apache.hadoop.hdfs.protocol.LocatedBlocks @ 0x1008511888       
>                            |           40 |         5,576
> |  |- <class> class org.apache.hadoop.hdfs.protocol.LocatedBlocks @ 
> 0x1009e20278                         |            8 |           512
> |  |- blocks java.util.ArrayList @ 0x10085118b0                               
>                            |           24 |         2,760
> |  |  |- <class> class java.util.ArrayList @ 0x100573da10 System Class        
>                            |           32 |           240
> |  |  |- elementData java.lang.Object[1] @ 0x10085118c8                       
>                            |           24 |         2,736
> |  |  |  |- class java.lang.Object[] @ 0x1005fc4650                           
>                            |            0 |             0
> |  |  |  |- [0] org.apache.hadoop.hdfs.protocol.LocatedBlock @ 0x10085118e0   
>                            |           48 |         2,712
> |  |  |  |  |- <class> class org.apache.hadoop.hdfs.protocol.LocatedBlock @ 
> 0x1009e26700                 |           16 |           424
> |  |  |  |  |- storageIDs java.lang.String[3] @ 0x10085117f8                  
>                            |           32 |            32
> |  |  |  |  |- storageTypes org.apache.hadoop.fs.StorageType[3] @ 
> 0x1008511818                           |           32 |            32
> |  |  |  |  |- b org.apache.hadoop.hdfs.protocol.ExtendedBlock @ 0x1008511910 
>                            |           24 |            64
> |  |  |  |  |- locs 
> org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[3] @ 0x1008511950     
>        |           32 |         2,456
> |  |  |  |  |  |- class 
> org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[] @ 0x102005b000      
>    |            0 |             0
> |  |  |  |  |  |- [2] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage 
> @ 0x1008511970             |          200 |           808
> |  |  |  |  |  |- [1] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage 
> @ 0x1008511c98             |          200 |           808
> |  |  |  |  |  |- [0] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage 
> @ 0x1008511fc0             |          200 |           808
> |  |  |  |  |  '- Total: 4 entries                                            
>                            |              |
> |  |  |  |  |- blockToken org.apache.hadoop.security.token.Token @ 
> 0x10085122e8                          |           32 |           144
> |  |  |  |  |- cachedLocs 
> org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[0] @ 0x101b01f328     
>  |           16 |            16
> |  |  |  |  '- Total: 7 entries                                               
>                            |              |
> |  |  |  '- Total: 2 entries                                                  
>                            |              |
> |  |  '- Total: 2 entries                                                     
>                            |              |
> |  |- lastLocatedBlock org.apache.hadoop.hdfs.protocol.LocatedBlock @ 
> 0x1008512378                       |           48 |         2,776
> |  |  |- <class> class org.apache.hadoop.hdfs.protocol.LocatedBlock @ 
> 0x1009e26700                       |           16 |           424
> |  |  |- b org.apache.hadoop.hdfs.protocol.ExtendedBlock @ 0x10085123a8       
>                            |           24 |            64
> |  |  |- locs org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[3] @ 
> 0x10085123e8                  |           32 |         2,216
> |  |  |  |- class org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[] @ 
> 0x102005b000               |            0 |             0
> |  |  |  |- [2] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage @ 
> 0x1008512408                   |          200 |           728
> |  |  |  |- [1] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage @ 
> 0x1008512730                   |          200 |           728
> |  |  |  |- [0] org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage @ 
> 0x1008512a58                   |          200 |           728
> |  |  |  |  |- <class> class 
> org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage @ 0x102005aee8      | 
>            8 |           104
> |  |  |  |  |- ipAddr java.lang.String @ 0x1008512b20  xxx.xxx.xxx.xxx        
>                            |           24 |            56
> |  |  |  |  |- ipAddrBytes com.google.protobuf.LiteralByteString @ 
> 0x1008512b58                          |           24 |            56
> |  |  |  |  |- hostName java.lang.String @ 0x1008512b90  www.abc.com          
>                            |           24 |            56
> |  |  |  |  |- hostNameBytes com.google.protobuf.LiteralByteString @ 
> 0x1008512bc8                        |           24 |            56
> |  |  |  |  |- xferAddr java.lang.String @ 0x1008512c00  xxx.xxx.xxx.xxx:9866 
>                            |           24 |            64
> |  |  |  |  |- datanodeUuid java.lang.String @ 0x1008512c40  
> 2f6e6e42-9347-4370-a318-79efdadcc3cf        |           24 |            80
> |  |  |  |  |- datanodeUuidBytes com.google.protobuf.LiteralByteString @ 
> 0x1008512c90                    |           24 |            80
> |  |  |  |  |- location java.lang.String @ 0x1008512ce0  /default             
>                            |           24 |            48
> |  |  |  |  |- dependentHostNames java.util.LinkedList @ 0x1008512d10         
>                            |           32 |            32
> |  |  |  |  |- storageID java.lang.String @ 0x1008512d30  
> DS-f190d2ef-755b-4f73-bb3d-67b6e72805e2        |           24 |            80
> |  |  |  |  |- adminState 
> org.apache.hadoop.hdfs.protocol.DatanodeInfo$AdminStates @ 0x101b01ef50  
> NORMAL|           24 |            24
> |  |  |  |  |- storageType org.apache.hadoop.fs.StorageType @ 0x101b01f000  
> DISK                         |           24 |            24
> |  |  |  |  '- Total: 13 entries                                              
>                            |              |
> |  |  |  '- Total: 4 entries                                                  
>                            |              |
> |  |  |- storageIDs java.lang.String[3] @ 0x1008512d80                        
>                            |           32 |            32
> |  |  |- storageTypes org.apache.hadoop.fs.StorageType[3] @ 0x1008512da0      
>                            |           32 |            32
> |  |  |- blockToken org.apache.hadoop.security.token.Token @ 0x1008512dc0     
>                            |           32 |           144
> |  |  |- cachedLocs 
> org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[0] @ 0x101b01f328     
>        |           16 |            16
> |  |  '- Total: 7 entries                                                     
>                            |              |
> |  '- Total: 3 entries                                                        
>                            |              |
> '- Total: 10 entries{noformat}
> There are some duplicate strings like storageIDs and hostnames. We can invoke 
> String.intern() on them to save some space. But it'd be better to convert 
> these FileStatus objects into IcebergFileDescriptor in time to reduce the 
> space usage. Encoding IcebergFileDescriptor into bytes (which usually takes 
> 200 bytes for each file) in time can further save more space.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to