[
https://issues.apache.org/jira/browse/SPARK-11306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen updated SPARK-11306:
------------------------------
Component/s: Scheduler
> Executor JVM loss can lead to a hang in Standalone mode
> -------------------------------------------------------
>
> Key: SPARK-11306
> URL: https://issues.apache.org/jira/browse/SPARK-11306
> Project: Spark
> Issue Type: Bug
> Components: Scheduler
> Reporter: Kay Ousterhout
> Assignee: Kay Ousterhout
>
> This commit:
> https://github.com/apache/spark/commit/af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0
> introduced a bug where, in Standalone mode, if a task fails and crashes the
> JVM, the failure is considered a "normal failure" (meaning it's considered
> unrelated to the task), so the failure isn't counted against the task's
> maximum number of failures:
> https://github.com/apache/spark/commit/af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0#diff-a755f3d892ff2506a7aa7db52022d77cL138.
> As a result, if a task fails in a way that results in it crashing the JVM,
> it will continuously be re-launched, resulting in a hang.
> Unfortunately this issue is difficult to reproduce because of a race
> condition where we have multiple code paths that are used to handle executor
> losses, and in the setup I'm using, Akka's notification that the executor was
> lost always gets to the TaskSchedulerImpl first, so the task eventually gets
> killed (see my recent email to the dev list).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]