Github user markhamstra commented on the pull request:
https://github.com/apache/spark/pull/11254#issuecomment-185911203
I don't understand your reasoning. If a machine that was holding a needed
shuffle file is rebooting, then you should want the Stage to fail so that the
results can be recomputed -- if the Worker is gone, then so to are the shuffle
files, whether or not they were being served directly or by the shuffle service
co-located on the node. Why would you want to delay that recomputation?
To improve resiliency in the face of worker node failures/reboots, it would
be nice to have shuffle files replicated and fetch-able from other worker
nodes, but that is quite a different proposition from this PR.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]