[jira] [Created] (SPARK-54729) Proactively replicate shuffle data to FallbackStorage

Enrico Minack (Jira) Wed, 17 Dec 2025 00:45:06 -0800

Enrico Minack created SPARK-54729:
-------------------------------------

             Summary: Proactively replicate shuffle data to FallbackStorage
                 Key: SPARK-54729
                 URL: https://issues.apache.org/jira/browse/SPARK-54729
             Project: Spark
          Issue Type: New Feature
          Components: Spark Core
    Affects Versions: 4.2.0
            Reporter: Enrico Minack



In a Kubernetes environment, the {{FallbackStorage}} can used when an executor 
is gracefully decommissioned to migrate its shuffle data. This allows for 
dynamic allocation in Kubernetes.

Let's adds a mode where shuffle data of a task can be replicate to the 
{{FallbackStorage}} as soon as the task finishes. The shuffle data are still 
being served by the executor while the {{FallbackStorage}} simply holds a 
proactively copied replica of the data.

This brings the following advantages:
 # *The decommissioning phase speed up:* The decommissioning phase is sped up 
since all data already exist on the {{FallbackStorage}}. The decommissioning 
phase simplifies to merely updating the location of shuffle data to the 
{{FallbackStorage}}.
 # *Node failure resiliency:* Shuffle data of executors that did not went 
through the decommissioning phase can be recovered by simply reading from the 
{{FallbackStorage}}.

There are two modes:
 # *Async copy (best-effort mode):* Shuffle data are asynchroniously copied 
*after* a task finishes. No delay is added as data are copied in the 
background. There is a high chance of the replica to exist, but no guarantee.
 # *Sync copy (reliable mode):* Shuffle data are copied *at the end* of the 
task. This defers the task to finish by the time needed to copy the shuffle 
data. A successful task guarantees the shuffle data replica exists.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-54729) Proactively replicate shuffle data to FallbackStorage

Reply via email to