Github user sitalkedia commented on the issue:
https://github.com/apache/spark/pull/17088
>> This is quite drastic for a fetch failure : spark already has mechanisms
in place to detect executor/host failure - which take care of these failure
modes.
Unfortunately, mechanisms already in place are not sufficient. Let's
imagine a situation where the shuffle service become unresponsive or OOMs, in
that case, we will not see any host failure, still the driver will receive
fetch failure. Current model assumes all shuffle output for an executor is
lost, however, since the shuffle service serves all the shuffle files on that
host, we should mark all the shuffle files on that host as unavailable.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]