Github user sitalkedia commented on the issue:
https://github.com/apache/spark/pull/17088
>> fetch failure does not imply lost executor - it could be a transient
issue.
Similarly, executor loss does not imply host loss.
You are right, it could be transient, but we do have retries on the shuffle
client to detect transient failure. In case driver receives a fetch failure, we
always assume that the output is lost. The current model assumes the output is
lost for a particular executor, which makes sense only if the shuffle service
is disabled and the executors are serving the shuffle files themselves.
However, in case the external shuffle service is enabled, a fetch failure
means all output on that host should be marked unavailable.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]