EnricoMi opened a new pull request, #53502: URL: https://github.com/apache/spark/pull/53502
### What changes were proposed in this pull request? This adds a mode where shuffle data of a task can be replicate to the `FallbackStorage` as soon as the task finishes. The shuffle data are still being served by the executor while the `FallbackStorage` simply holds a proactively copied replica of the data. This brings the following advantages: 1. **The decommissioning phase speed up:** The decommissioning phase is sped up since all data already exist on the `FallbackStorage`. There is no need to copy any data by that time. The decommissioning phase simplifies to merely updating the location of shuffle data to the `FallbackStorage`. 2. **Node failure resiliency:** Shuffle data of executors that did not went through the decommissioning phase can be recovered by simply reading from the `FallbackStorage`. There are two modes: 1. **Async copy (best-effort mode):** Shuffle data are copied asynchronously **after** a task finishes. No delay is added as data are copied in the background. There is a high chance of the replica to exist, but no guarantee. 2. **Sync copy (reliable mode):** Shuffle data are copied **at the end** of the task. This defers the task completion by the time needed to copy the shuffle data. A successful task guarantees the shuffle data replica exists. ### Why are the changes needed? In a Kubernetes environment, the `FallbackStorage` can used to migrate its shuffle data when an executor is gracefully decommissioned. This allows for dynamic allocation in Kubernetes. The decommissioning phase is constraint in time it is allowed to consume. There are situations where no decommissioning phase exist, e.g. node or pod failures. ### Does this PR introduce _any_ user-facing change? The user can enable this feature through the following configuration, which default to current behaviour. - `spark.storage.decommission.fallbackStorage.proactive.enabled` - `spark.storage.decommission.fallbackStorage.proactive.reliable` ### How was this patch tested? Unit tests and tested in production environment with excessive dynamic allocation down-scaling. ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
