Aaron Davidson created SPARK-4238:
-------------------------------------

             Summary: Perform network-level retry of shuffle file fetches
                 Key: SPARK-4238
                 URL: https://issues.apache.org/jira/browse/SPARK-4238
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
            Reporter: Aaron Davidson
            Assignee: Aaron Davidson
            Priority: Critical


During periods of high network (or GC) load, it is not uncommon that 
IOExceptions crop up around connection failures when fetching shuffle files. 
Unfortunately, when such a failure occurs, it is interpreted as an inability to 
fetch the files, which causes us to mark the executor as lost and recompute all 
of its shuffle outputs.

We should allow retrying at the network level in the event of an IOException in 
order to avoid this circumstance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to