[
https://issues.apache.org/jira/browse/SPARK-13669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Thomas Graves reassigned SPARK-13669:
-------------------------------------
Assignee: Saisai Shao
> Job will always fail in the external shuffle service unavailable situation
> --------------------------------------------------------------------------
>
> Key: SPARK-13669
> URL: https://issues.apache.org/jira/browse/SPARK-13669
> Project: Spark
> Issue Type: Bug
> Components: Spark Core, YARN
> Reporter: Saisai Shao
> Assignee: Saisai Shao
> Fix For: 2.3.0
>
>
> Currently we are running into an issue with Yarn work preserving enabled +
> external shuffle service.
> In the work preserving enabled scenario, the failure of NM will not lead to
> the exit of executors, so executors can still accept and run the tasks. The
> problem here is when NM is failed, external shuffle service is actually
> inaccessible, so reduce tasks will always complain about the “Fetch failure”,
> and the failure of reduce stage will make the parent stage (map stage) rerun.
> The tricky thing here is Spark scheduler is not aware of the unavailability
> of external shuffle service, and will reschedule the map tasks on the
> executor where NM is failed, and again reduce stage will be failed with
> “Fetch failure”, and after 4 retries, the job is failed.
> So here the main problem is that we should avoid assigning tasks to those bad
> executors (where shuffle service is unavailable). Current Spark's blacklist
> mechanism could blacklist executors/nodes by failure tasks, but it doesn't
> handle this specific fetch failure scenario. So here propose to improve the
> current application blacklist mechanism to handle fetch failure issue
> (especially with external shuffle service unavailable issue), to blacklist
> the executors/nodes where shuffle fetch is unavailable.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]