[
https://issues.apache.org/jira/browse/SPARK-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia resolved SPARK-644.
---------------------------------
Resolution: Fixed
> Jobs canceled due to repeated executor failures may hang
> --------------------------------------------------------
>
> Key: SPARK-644
> URL: https://issues.apache.org/jira/browse/SPARK-644
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 0.6.1
> Reporter: Josh Rosen
> Assignee: Josh Rosen
>
> In order to prevent an infinite loop, the standalone master aborts jobs that
> experience more than 10 executor failures (see
> https://github.com/mesos/spark/pull/210). Currently, the master crashes when
> aborting jobs (this is the issue that uncovered SPARK-643). If we fix the
> crash, which involves removing a {{throw}} from the actor's {{receive}}
> method, then these failures can lead to a hang because they cause the job to
> be removed from the master's scheduler, but the upstream scheduler components
> aren't notified of the failure and will wait for the job to finish.
> I've considered fixing this by adding additional callbacks to propagate the
> failure to the higher-level schedulers. It might be cleaner to move the
> decision to abort the job into the higher-level layers of the scheduler,
> sending an {{AbortJob(jobId)}} method to the Master. The Client is already
> notified of executor state changes, so it may be able to make the decision to
> abort (or defer that decision to a higher layer).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]