Github user markhamstra commented on the pull request:

    https://github.com/apache/spark/pull/11254#issuecomment-185911203
  
    I don't understand your reasoning.  If a machine that was holding a needed 
shuffle file is rebooting, then you should want the Stage to fail so that the 
results can be recomputed -- if the Worker is gone, then so to are the shuffle 
files, whether or not they were being served directly or by the shuffle service 
co-located on the node.  Why would you want to delay that recomputation?
    
    To improve resiliency in the face of worker node failures/reboots, it would 
be nice to have shuffle files replicated and fetch-able from other worker 
nodes, but that is quite a different proposition from this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to