Aaron Davidson created SPARK-4238:
-------------------------------------
Summary: Perform network-level retry of shuffle file fetches
Key: SPARK-4238
URL: https://issues.apache.org/jira/browse/SPARK-4238
Project: Spark
Issue Type: Bug
Components: Spark Core
Reporter: Aaron Davidson
Assignee: Aaron Davidson
Priority: Critical
During periods of high network (or GC) load, it is not uncommon that
IOExceptions crop up around connection failures when fetching shuffle files.
Unfortunately, when such a failure occurs, it is interpreted as an inability to
fetch the files, which causes us to mark the executor as lost and recompute all
of its shuffle outputs.
We should allow retrying at the network level in the event of an IOException in
order to avoid this circumstance.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]