holdenk opened a new pull request #29511: URL: https://github.com/apache/spark/pull/29511
### What changes were proposed in this pull request? Add the option to launch external shuffle service as a sidecar on Kubernetes. ### Why are the changes needed? Users with flaky jobs or who experience OOMs may loose data which was local to their executor without an external shuffle service. Note: this change does *not* help with dynamic scheduling or cluster over-commit situations. ### Does this PR introduce _any_ user-facing change? Yes: provides an option to enable a sidecar shuffle service. ### How was this patch tested? Manual: verified that blocks were served from the external shuffle service sidecar Automatic: verify the external shuffle service launches & register the associated executor and vice versa. ### Whats work in progress? Code style & we might want to add some more configuration options. Also this picks up a change so the pod restarts on failure, but there is a separate PR to make that configurable it would be better to get that merged in first instead. The restartOnFailure change is used so that if somehow the executor starts before the shuffle service, it will simply restart the executor. In Kubernetes 1.19 we might be able to mark the shuffle service as a sidecar explicitly so that it starts first, but that isn't going to be widely available soon. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
