[PR] [GLUTEN-5320][VL] Reduce driver memory footprint by postpone the creation and serialization of LocalFilesNode [incubator-gluten]

via GitHub Mon, 08 Apr 2024 01:32:06 -0700


WangGuangxin opened a new pull request, #5321:
URL: https://github.com/apache/incubator-gluten/pull/5321


   ## What changes were proposed in this pull request?
   
   Currently, driver generate `GlutenPartition` based on spark's 
`FilePartitions`,  and then convert to `LocalFilesNode` and serialized to byte 
array in pb format. 
   This will double the driver memory, because the `FilePartitions` are not 
destroyed after convert to `LocalFilesNodes`.
   When there are many file splits ( file status) , the impact is significant.
   
   For example, in one of our case, there are total 48 hdfs paths to list, 
7039474 files under them. With vanilla spark, it can work with driver memory = 
20G, but failed in Gluten. 
   
   From the gc log, we can find that Gluten has more `String` and `Byte[]` 
objects than vanilla spark.
   
   Vanilla Spark Full GC objects
   ```
    num     #instances         #bytes  class name
   ----------------------------------------------
      1:      42535479     8856286272  [C
      2:      42538104     1020914496  java.lang.String
      3:       7044015      563521200  java.net.URI
      4:       7039474      506842128  org.apache.hadoop.fs.LocatedFileStatus
      5:         13412      332304008  [B
      6:       7039474      281578960  
org.apache.spark.sql.execution.datasources.PartitionedFile
      7:       7040016      225280512  
scala.collection.mutable.LinkedHashSet$Entry
      8:       7039542      225265344  scala.collection.mutable.LinkedEntry
      9:       7039479      225263328  
org.apache.hadoop.fs.permission.FsPermission
     10:          1412      151374272  [Lscala.collection.mutable.HashEntry;
     11:           145      125501688  [Lorg.apache.hadoop.fs.FileStatus;
     12:       7039625      112634000  org.apache.hadoop.fs.Path
     13:         55673       42854960  [Ljava.lang.Object;
     14:        146968       30759312  
[Lorg.apache.spark.sql.execution.datasources.PartitionedFile;
     15:          2462       27069520  [J
     16:       1004712       24113088  
java.util.concurrent.ConcurrentSkipListMap$Node
     17:        146968       16460416  org.apache.spark.scheduler.ResultTask
     18:        791929       12670864  scala.Some
   ```
   Gluten Full GC objects (before this patch)
   ```
   num     #instances         #bytes  class name
   ----------------------------------------------
      1:      70600217     9596405088  [C
      2:        153749     2117256784  [B
      3:      70603033     1694472792  java.lang.String
      4:      28210146      902724672  java.util.HashMap$Node
      5:       7056556      564282560  [Ljava.util.HashMap$Node;
      6:       7044001      563520080  java.net.URI
      7:       7039474      506842128  org.apache.hadoop.fs.LocatedFileStatus
      8:       7054771      338629008  java.util.HashMap
      9:       7039496      225263872  scala.collection.mutable.LinkedEntry
     10:       7039479      225263328  
org.apache.hadoop.fs.permission.FsPermission
     11:       7040463      168971112  java.lang.Long
     12:        777126      135040840  [Ljava.lang.Object;
     13:       7039578      112633248  org.apache.hadoop.fs.Path
     14:          1332       67224064  [Lscala.collection.mutable.HashEntry;
     15:            97       56405176  [Lorg.apache.hadoop.fs.FileStatus;
     16:        748173       17956152  java.util.ArrayList
     17:        593611       14246664  scala.collection.immutable.$colon$colon
     18:          1919        9036728  [J
   ```
   
   Gluten Full GC objects (after this patch)
   ```
   num     #instances         #bytes  class name
   ----------------------------------------------
      1:      50009922    11752807376  [C
      2:      49812651     1195503624  java.lang.String
      3:       7043968      563517440  java.net.URI
      4:       7039474      506842128  org.apache.hadoop.fs.LocatedFileStatus
      5:       7039474      394210544  
org.apache.spark.util.HadoopFSUtils$SerializableFileStatus
      6:         26766      259720056  [B
      7:       7039479      225263328  
org.apache.hadoop.fs.permission.FsPermission
      8:       7039572      112633152  org.apache.hadoop.fs.Path
      9:         45775       68452656  [Ljava.lang.Object;
     10:       1573313       50346016  
scala.collection.mutable.LinkedHashSet$Entry
     11:          1304       33665792  [Lscala.collection.mutable.HashEntry;
     12:         14435       15252040  [I
     13:            13        6756208  [Lorg.apache.hadoop.fs.FileStatus;
     14:        167935        5373920  
java.util.concurrent.ConcurrentHashMap$Node
     15:        122916        3933312  java.util.Hashtable$Entry
     16:         31958        3531872  java.lang.Class
     17:         97118        3107776  
scala.collection.mutable.ArrayBuilder$ofRef
     18:         97117        3107744  java.net.URI$Parser
   ```
   
   (Fixes: \#5320)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [GLUTEN-5320][VL] Reduce driver memory footprint by postpone the creation and serialization of LocalFilesNode [incubator-gluten]

Reply via email to