kevinwilfong opened a new pull request, #10662: URL: https://github.com/apache/incubator-gluten/pull/10662
## What changes are proposed in this pull request? Today the GlutenPartition objects contain an array of byte arrays which are the Protobuf serialized ReadRel.read_type objects from the SplitInfos. The GlutenPartitions are Java serialized and sent to the Executors responsible for their respective Tasks. Looking through the code it appears we Protobuf serialize the SplitInfos so we can easily pass them across the JNI boundary. We see the serialized SplitInfos can consume a significant amount of memory in the Driver, this is because as SplitInfo objects their state can share references tot he same objects, but once serialized they share nothing, which explodes their size in memory. If we Java serialize the SplitInfo objects like the rest of the GlutenPartition state, and Protobuf serialize them as part of the Task, this can significantly save driver memory. The cost is a little additional memory in the Task, the size of the SplitInfo objects for a single GlutenPartition which should be trivial, and a little additional CPU instead of Protobuf serializing in the Driver and Java serializing the array of byte arrays, we Java serialize the array of SplitInfos, and on the Task we pay the additional cost of Java deserializing an array of SplitInfos and Protobuf serializing them, overall the difference is just the additional cost of Java serializing the SplitInfos instead of byte arrays. ## How was this patch tested? Ran the existing unit tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
