Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/17088
  
    >> fetch failure does not imply lost executor - it could be a transient 
issue.
    Similarly, executor loss does not imply host loss.
    
    You are right, it could be transient, but we do have retries on the shuffle 
client to detect transient failure. In case driver receives a fetch failure, we 
always assume that the output is lost. The current model assumes the output is 
lost for a particular executor, which makes sense only if the shuffle service 
is disabled and the executors are serving the shuffle files themselves. 
However,  in case the external shuffle service is enabled, a fetch failure 
means all output on that host should be marked unavailable. 
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to