[PR] [SPARK-54729][Core] Proactively replicate shuffle data to the FallbackStorage [spark]

via GitHub Wed, 17 Dec 2025 00:54:28 -0800


EnricoMi opened a new pull request, #53502:
URL: https://github.com/apache/spark/pull/53502


   ### What changes were proposed in this pull request?
   This adds a mode where shuffle data of a task can be replicate to the 
`FallbackStorage` as soon as the task finishes. The shuffle data are still 
being served by the executor while the `FallbackStorage` simply holds a 
proactively copied replica of the data.
   
   This brings the following advantages:
   
   1. **The decommissioning phase speed up:** The decommissioning phase is sped 
up since all data already exist on the `FallbackStorage`. There is no need to 
copy any data by that time. The decommissioning phase simplifies to merely 
updating the location of shuffle data to the `FallbackStorage`.
   2. **Node failure resiliency:** Shuffle data of executors that did not went 
through the decommissioning phase can be recovered by simply reading from the 
`FallbackStorage`.
   
   There are two modes:
   
   1. **Async copy (best-effort mode):** Shuffle data are copied asynchronously 
**after** a task finishes. No delay is added as data are copied in the 
background. There is a high chance of the replica to exist, but no guarantee.
   2. **Sync copy (reliable mode):** Shuffle data are copied **at the end** of 
the task. This defers the task completion by the time needed to copy the 
shuffle data. A successful task guarantees the shuffle data replica exists.
   
   ### Why are the changes needed?
   In a Kubernetes environment, the `FallbackStorage` can used to migrate its 
shuffle data when an executor is gracefully decommissioned. This allows for 
dynamic allocation in Kubernetes. The decommissioning phase is constraint in 
time it is allowed to consume. There are situations where no decommissioning 
phase exist, e.g. node or pod failures.
   
   ### Does this PR introduce _any_ user-facing change?
   The user can enable this feature through the following configuration, which 
default to current behaviour.
   - `spark.storage.decommission.fallbackStorage.proactive.enabled`
   - `spark.storage.decommission.fallbackStorage.proactive.reliable`
   
   ### How was this patch tested?
   Unit tests and tested in production environment with excessive dynamic 
allocation down-scaling.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-54729][Core] Proactively replicate shuffle data to the FallbackStorage [spark]

Reply via email to