WangGuangxin opened a new pull request, #5321:
URL: https://github.com/apache/incubator-gluten/pull/5321
## What changes were proposed in this pull request?
Currently, driver generate `GlutenPartition` based on spark's
`FilePartitions`, and then convert to `LocalFilesNode` and serialized to byte
array in pb format.
This will double the driver memory, because the `FilePartitions` are not
destroyed after convert to `LocalFilesNodes`.
When there are many file splits ( file status) , the impact is significant.
For example, in one of our case, there are total 48 hdfs paths to list,
7039474 files under them. With vanilla spark, it can work with driver memory =
20G, but failed in Gluten.
From the gc log, we can find that Gluten has more `String` and `Byte[]`
objects than vanilla spark.
Vanilla Spark Full GC objects
```
num #instances #bytes class name
----------------------------------------------
1: 42535479 8856286272 [C
2: 42538104 1020914496 java.lang.String
3: 7044015 563521200 java.net.URI
4: 7039474 506842128 org.apache.hadoop.fs.LocatedFileStatus
5: 13412 332304008 [B
6: 7039474 281578960
org.apache.spark.sql.execution.datasources.PartitionedFile
7: 7040016 225280512
scala.collection.mutable.LinkedHashSet$Entry
8: 7039542 225265344 scala.collection.mutable.LinkedEntry
9: 7039479 225263328
org.apache.hadoop.fs.permission.FsPermission
10: 1412 151374272 [Lscala.collection.mutable.HashEntry;
11: 145 125501688 [Lorg.apache.hadoop.fs.FileStatus;
12: 7039625 112634000 org.apache.hadoop.fs.Path
13: 55673 42854960 [Ljava.lang.Object;
14: 146968 30759312
[Lorg.apache.spark.sql.execution.datasources.PartitionedFile;
15: 2462 27069520 [J
16: 1004712 24113088
java.util.concurrent.ConcurrentSkipListMap$Node
17: 146968 16460416 org.apache.spark.scheduler.ResultTask
18: 791929 12670864 scala.Some
```
Gluten Full GC objects (before this patch)
```
num #instances #bytes class name
----------------------------------------------
1: 70600217 9596405088 [C
2: 153749 2117256784 [B
3: 70603033 1694472792 java.lang.String
4: 28210146 902724672 java.util.HashMap$Node
5: 7056556 564282560 [Ljava.util.HashMap$Node;
6: 7044001 563520080 java.net.URI
7: 7039474 506842128 org.apache.hadoop.fs.LocatedFileStatus
8: 7054771 338629008 java.util.HashMap
9: 7039496 225263872 scala.collection.mutable.LinkedEntry
10: 7039479 225263328
org.apache.hadoop.fs.permission.FsPermission
11: 7040463 168971112 java.lang.Long
12: 777126 135040840 [Ljava.lang.Object;
13: 7039578 112633248 org.apache.hadoop.fs.Path
14: 1332 67224064 [Lscala.collection.mutable.HashEntry;
15: 97 56405176 [Lorg.apache.hadoop.fs.FileStatus;
16: 748173 17956152 java.util.ArrayList
17: 593611 14246664 scala.collection.immutable.$colon$colon
18: 1919 9036728 [J
```
Gluten Full GC objects (after this patch)
```
num #instances #bytes class name
----------------------------------------------
1: 50009922 11752807376 [C
2: 49812651 1195503624 java.lang.String
3: 7043968 563517440 java.net.URI
4: 7039474 506842128 org.apache.hadoop.fs.LocatedFileStatus
5: 7039474 394210544
org.apache.spark.util.HadoopFSUtils$SerializableFileStatus
6: 26766 259720056 [B
7: 7039479 225263328
org.apache.hadoop.fs.permission.FsPermission
8: 7039572 112633152 org.apache.hadoop.fs.Path
9: 45775 68452656 [Ljava.lang.Object;
10: 1573313 50346016
scala.collection.mutable.LinkedHashSet$Entry
11: 1304 33665792 [Lscala.collection.mutable.HashEntry;
12: 14435 15252040 [I
13: 13 6756208 [Lorg.apache.hadoop.fs.FileStatus;
14: 167935 5373920
java.util.concurrent.ConcurrentHashMap$Node
15: 122916 3933312 java.util.Hashtable$Entry
16: 31958 3531872 java.lang.Class
17: 97118 3107776
scala.collection.mutable.ArrayBuilder$ofRef
18: 97117 3107744 java.net.URI$Parser
```
(Fixes: \#5320)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]