[
https://issues.apache.org/jira/browse/SPARK-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13958329#comment-13958329
]
Kan Zhang commented on SPARK-1118:
----------------------------------
I took a look running SparkPi on my single node cluster (laptop). There seems
to be 2 issues.
1. All the work was done in the first executor. When the job is done, driver
asks the executor to shutdown. However, this clean exit was assigned FAILED
executor state by Worker. I introduced EXITED executor state for executors who
voluntarily exit (both normal and abnormal exit depending on the exit code).
2. When Master gets notified the first executor exited, it launched a second
one, which is not needed and subsequently got killed when App disassociates. We
could change the scheduler to tell Master the job is done so that Master
wouldn't start the second executor. However, there is a race condition between
App telling Master job is done and Worker telling Master the first executor
exited. There is no guarantee the former will happen before the later. Instead,
I chose to check the exit code when executor exits. If the exit code is 0, I
assume executor has been asked to shutdown by driver and Master will not
schedule new executors. This avoids the second executor being launched and
consequently no executor is killed in Worker's log. However, it is still
possible (although didn't happen on my local cluster), the first executor gets
killed by Master, if Master detects App disassociation event before the first
executor exited. The order of these events can't be guaranteed since they come
from different paths. If an executor does get killed, I favor leaving its state
as KILLED, even though the App state may be FINISHED.
Here's the PR. Pls let me know what else I can do.
https://github.com/apache/spark/pull/306
> Executor state shows as KILLED even the application is finished normally
> ------------------------------------------------------------------------
>
> Key: SPARK-1118
> URL: https://issues.apache.org/jira/browse/SPARK-1118
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Reporter: Nan Zhu
> Fix For: 1.0.0
>
>
> This seems weird, ExecutorState has no option of FINISHED, a terminated
> executor can only be KILLED, FAILED, LOST
--
This message was sent by Atlassian JIRA
(v6.2#6252)