WangGuangxin opened a new issue, #5320:
URL: https://github.com/apache/incubator-gluten/issues/5320
### Description
Currently, driver generate `GlutenPartition` based on spark's
`FilePartitions`, and then convert to `LocalFilesNode` and serialized to byte
array in pb format.
This will double the driver memory, because the `FilePartitions` are not
destroyed after convert to `LocalFilesNodes`.
When there are many file splits ( file status) , the impact is significant.
For example, in one of our case, there are total 48 hdfs paths to list,
7039474 files under them. With vanilla spark, it can work with driver memory =
20G, but failed in Gluten.
From the gc log, we can find that Gluten has more `String` and `Byte[]`
objects than vanilla spark.
Vanilla Spark Full GC objects
```
num #instances #bytes class name
----------------------------------------------
1: 42535479 8856286272 [C
2: 42538104 1020914496 java.lang.String
3: 7044015 563521200 java.net.URI
4: 7039474 506842128 org.apache.hadoop.fs.LocatedFileStatus
5: 13412 332304008 [B
6: 7039474 281578960
org.apache.spark.sql.execution.datasources.PartitionedFile
7: 7040016 225280512
scala.collection.mutable.LinkedHashSet$Entry
8: 7039542 225265344 scala.collection.mutable.LinkedEntry
9: 7039479 225263328
org.apache.hadoop.fs.permission.FsPermission
10: 1412 151374272 [Lscala.collection.mutable.HashEntry;
11: 145 125501688 [Lorg.apache.hadoop.fs.FileStatus;
12: 7039625 112634000 org.apache.hadoop.fs.Path
13: 55673 42854960 [Ljava.lang.Object;
14: 146968 30759312
[Lorg.apache.spark.sql.execution.datasources.PartitionedFile;
15: 2462 27069520 [J
16: 1004712 24113088
java.util.concurrent.ConcurrentSkipListMap$Node
17: 146968 16460416 org.apache.spark.scheduler.ResultTask
18: 791929 12670864 scala.Some
```
Gluten Full GC objects
```
num #instances #bytes class name
----------------------------------------------
1: 70600217 9596405088 [C
2: 153749 2117256784 [B
3: 70603033 1694472792 java.lang.String
4: 28210146 902724672 java.util.HashMap$Node
5: 7056556 564282560 [Ljava.util.HashMap$Node;
6: 7044001 563520080 java.net.URI
7: 7039474 506842128 org.apache.hadoop.fs.LocatedFileStatus
8: 7054771 338629008 java.util.HashMap
9: 7039496 225263872 scala.collection.mutable.LinkedEntry
10: 7039479 225263328
org.apache.hadoop.fs.permission.FsPermission
11: 7040463 168971112 java.lang.Long
12: 777126 135040840 [Ljava.lang.Object;
13: 7039578 112633248 org.apache.hadoop.fs.Path
14: 1332 67224064 [Lscala.collection.mutable.HashEntry;
15: 97 56405176 [Lorg.apache.hadoop.fs.FileStatus;
16: 748173 17956152 java.util.ArrayList
17: 593611 14246664 scala.collection.immutable.$colon$colon
18: 1919 9036728 [J
```
So I propose to reuse the `FilePartition` in Driver, and postpone the
conversion of `LocalFilesNode` in executor
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]