[jira] [Updated] (SPARK-54729) Proactively replicate shuffle data to FallbackStorage

Enrico Minack (Jira) Wed, 17 Dec 2025 00:56:08 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-54729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Enrico Minack updated SPARK-54729:
----------------------------------
    Description: 
In a Kubernetes environment, the {{FallbackStorage}} can used when an executor 
is gracefully decommissioned to migrate its shuffle data. This allows for 
dynamic allocation in Kubernetes.

Let's adds a mode where shuffle data of a task can be replicate to the 
{{FallbackStorage}} as soon as the task finishes. The shuffle data are still 
being served by the executor while the {{FallbackStorage}} simply holds a 
proactively copied replica of the data.

This brings the following advantages:
 # *The decommissioning phase speed up:* The decommissioning phase is sped up 
since all data already exist on the {{{}FallbackStorage{}}}. The 
decommissioning phase simplifies to merely updating the location of shuffle 
data to the {{{}FallbackStorage{}}}.
 # *Node failure resiliency:* Shuffle data of executors that did not went 
through the decommissioning phase can be recovered by simply reading from the 
{{{}FallbackStorage{}}}.

There are two modes:
 # *Async copy (best-effort mode):* Shuffle data are asynchronously copied 
*after* a task finishes. No delay is added as data are copied in the 
background. There is a high chance of the replica to exist, but no guarantee.
 # *Sync copy (reliable mode):* Shuffle data are copied *at the end* of the 
task. This defers the task to finish by the time needed to copy the shuffle 
data. A successful task guarantees the shuffle data replica exists.

  was:
In a Kubernetes environment, the {{FallbackStorage}} can used when an executor 
is gracefully decommissioned to migrate its shuffle data. This allows for 
dynamic allocation in Kubernetes.

Let's adds a mode where shuffle data of a task can be replicate to the 
{{FallbackStorage}} as soon as the task finishes. The shuffle data are still 
being served by the executor while the {{FallbackStorage}} simply holds a 
proactively copied replica of the data.

This brings the following advantages:
 # *The decommissioning phase speed up:* The decommissioning phase is sped up 
since all data already exist on the {{FallbackStorage}}. The decommissioning 
phase simplifies to merely updating the location of shuffle data to the 
{{FallbackStorage}}.
 # *Node failure resiliency:* Shuffle data of executors that did not went 
through the decommissioning phase can be recovered by simply reading from the 
{{FallbackStorage}}.

There are two modes:
 # *Async copy (best-effort mode):* Shuffle data are asynchroniously copied 
*after* a task finishes. No delay is added as data are copied in the 
background. There is a high chance of the replica to exist, but no guarantee.
 # *Sync copy (reliable mode):* Shuffle data are copied *at the end* of the 
task. This defers the task to finish by the time needed to copy the shuffle 
data. A successful task guarantees the shuffle data replica exists.


> Proactively replicate shuffle data to FallbackStorage
> -----------------------------------------------------
>
>                 Key: SPARK-54729
>                 URL: https://issues.apache.org/jira/browse/SPARK-54729
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 4.2.0
>            Reporter: Enrico Minack
>            Priority: Major
>              Labels: pull-request-available
>
> In a Kubernetes environment, the {{FallbackStorage}} can used when an 
> executor is gracefully decommissioned to migrate its shuffle data. This 
> allows for dynamic allocation in Kubernetes.
> Let's adds a mode where shuffle data of a task can be replicate to the 
> {{FallbackStorage}} as soon as the task finishes. The shuffle data are still 
> being served by the executor while the {{FallbackStorage}} simply holds a 
> proactively copied replica of the data.
> This brings the following advantages:
>  # *The decommissioning phase speed up:* The decommissioning phase is sped up 
> since all data already exist on the {{{}FallbackStorage{}}}. The 
> decommissioning phase simplifies to merely updating the location of shuffle 
> data to the {{{}FallbackStorage{}}}.
>  # *Node failure resiliency:* Shuffle data of executors that did not went 
> through the decommissioning phase can be recovered by simply reading from the 
> {{{}FallbackStorage{}}}.
> There are two modes:
>  # *Async copy (best-effort mode):* Shuffle data are asynchronously copied 
> *after* a task finishes. No delay is added as data are copied in the 
> background. There is a high chance of the replica to exist, but no guarantee.
>  # *Sync copy (reliable mode):* Shuffle data are copied *at the end* of the 
> task. This defers the task to finish by the time needed to copy the shuffle 
> data. A successful task guarantees the shuffle data replica exists.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-54729) Proactively replicate shuffle data to FallbackStorage

Reply via email to