Sital Kedia created SPARK-19753:
-----------------------------------
Summary: Remove all shuffle files on a host in case of slave lost
of fetch failure
Key: SPARK-19753
URL: https://issues.apache.org/jira/browse/SPARK-19753
Project: Spark
Issue Type: Bug
Components: Scheduler
Affects Versions: 2.0.1
Reporter: Sital Kedia
Currently, when we detect fetch failure, we only remove the shuffle files
produced by the executor, while the host itself might be down and all the
shuffle files are not accessible. In case we are running multiple executors on
a host, any host going down currently results in multiple fetch failures and
multiple retries of the stage, which is very inefficient. If we remove all the
shuffle files on that host, on first fetch failure, we can rerun all the tasks
on that host in a single stage retry.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]