holdenk opened a new pull request #29511:
URL: https://github.com/apache/spark/pull/29511


   ### What changes were proposed in this pull request?
   
   Add the option to launch external shuffle service as a sidecar on Kubernetes.
   
   ### Why are the changes needed?
   
   Users with flaky jobs or who experience OOMs may loose data which was local 
to their executor without an external shuffle service.
   
   Note: this change does *not* help with dynamic scheduling or cluster 
over-commit situations.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes: provides an option to enable a sidecar shuffle service.
   
   ### How was this patch tested?
   
   Manual: verified that blocks were served from the external shuffle service 
sidecar
   Automatic: verify the external shuffle service launches & register the 
associated executor and vice versa.
   
   ### Whats work in progress?
   
   Code style & we might want to add some more configuration options. Also this 
picks up a change so the pod restarts on failure, but there is a separate PR to 
make that configurable it would be better to get that merged in first instead.
   
   The restartOnFailure change is used so that if somehow the executor starts 
before the shuffle service, it will simply restart the executor. In Kubernetes 
1.19 we might be able to mark the shuffle service as a sidecar explicitly so 
that it starts first, but that isn't going to be widely available soon.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to