Min Shen created SPARK-33781:
--------------------------------

             Summary: Improve caching of MergeStatus on the executor side to 
save memory
                 Key: SPARK-33781
                 URL: https://issues.apache.org/jira/browse/SPARK-33781
             Project: Spark
          Issue Type: Sub-task
          Components: Spark Core
    Affects Versions: 3.1.0
            Reporter: Min Shen


In MapOutputTrackerWorker, it would cache the retrieved MapStatus or 
MergeStatus array for a given shuffle received from the driver in memory so 
that all tasks doing shuffle fetch for that shuffle can reuse the cached 
metadata.

However, different from MapStatus array, where each task would need to access 
every single instance in the array, each task would only need one or just a few 
MergeStatus objects from the MergeStatus array depending on which shuffle 
partitions the task is processing.

For large shuffles with 10s or 100s of thousands of shuffle partitions, caching 
the entire deserialized and decompressed MergeStatus array on the executor 
side, while perhaps only 0.1% of them are going to be used by the tasks running 
in this executor is a huge waste of memory.

We could improve this by caching the serialized and compressed bytes for 
MergeStatus array instead and only cache the needed deserialized MergeStatus 
object on the executor side. In addition to saving memory, it also helps with 
reducing GC pressure on executor side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to