Min Shen created SPARK-33781:
--------------------------------
Summary: Improve caching of MergeStatus on the executor side to
save memory
Key: SPARK-33781
URL: https://issues.apache.org/jira/browse/SPARK-33781
Project: Spark
Issue Type: Sub-task
Components: Spark Core
Affects Versions: 3.1.0
Reporter: Min Shen
In MapOutputTrackerWorker, it would cache the retrieved MapStatus or
MergeStatus array for a given shuffle received from the driver in memory so
that all tasks doing shuffle fetch for that shuffle can reuse the cached
metadata.
However, different from MapStatus array, where each task would need to access
every single instance in the array, each task would only need one or just a few
MergeStatus objects from the MergeStatus array depending on which shuffle
partitions the task is processing.
For large shuffles with 10s or 100s of thousands of shuffle partitions, caching
the entire deserialized and decompressed MergeStatus array on the executor
side, while perhaps only 0.1% of them are going to be used by the tasks running
in this executor is a huge waste of memory.
We could improve this by caching the serialized and compressed bytes for
MergeStatus array instead and only cache the needed deserialized MergeStatus
object on the executor side. In addition to saving memory, it also helps with
reducing GC pressure on executor side.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]