Kay Ousterhout created SPARK-11306:
--------------------------------------
Summary: Executor JVM loss can lead to a hang in Standalone mode
Key: SPARK-11306
URL: https://issues.apache.org/jira/browse/SPARK-11306
Project: Spark
Issue Type: Bug
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
This commit:
https://github.com/apache/spark/commit/af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0
introduced a bug where, in Standalone mode, if a task fails and crashes the
JVM, the failure is considered a "normal failure" (meaning it's considered
unrelated to the task), so the failure isn't counted against the task's maximum
number of failures:
https://github.com/apache/spark/commit/af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0#diff-a755f3d892ff2506a7aa7db52022d77cL138.
As a result, if a task fails in a way that results in it crashing the JVM, it
will continuously be re-launched, resulting in a hang.
Unfortunately this issue is difficult to reproduce because of a race condition
where we have multiple code paths that are used to handle executor losses, and
in the setup I'm using, Akka's notification that the executor was lost always
gets to the TaskSchedulerImpl first, so the task eventually gets killed (see my
recent email to the dev list).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]