[ 
https://issues.apache.org/jira/browse/SPARK-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13958329#comment-13958329
 ] 

Kan Zhang commented on SPARK-1118:
----------------------------------

I took a look running SparkPi on my single node cluster (laptop). There seems 
to be 2 issues.

1. All the work was done in the first executor. When the job is done, driver 
asks the executor to shutdown. However, this clean exit was assigned FAILED 
executor state by Worker. I introduced EXITED executor state for executors who 
voluntarily exit (both normal and abnormal exit depending on the exit code).

2. When Master gets notified the first executor exited, it launched a second 
one, which is not needed and subsequently got killed when App disassociates. We 
could change the scheduler to tell Master the job is done so that Master 
wouldn't start the second executor. However, there is a race condition between 
App telling Master job is done and Worker telling Master the first executor 
exited. There is no guarantee the former will happen before the later. Instead, 
I chose to check the exit code when executor exits. If the exit code is 0, I 
assume executor has been asked to shutdown by driver and Master will not 
schedule new executors. This avoids the second executor being launched and 
consequently no executor is killed in Worker's log. However, it is still 
possible (although didn't happen on my local cluster), the first executor gets 
killed by Master, if Master detects App disassociation event before the first 
executor exited. The order of these events can't be guaranteed since they come 
from different paths. If an executor does get killed, I favor leaving its state 
as KILLED, even though the App state may be FINISHED.

Here's the PR. Pls let me know what else I can do.

https://github.com/apache/spark/pull/306


> Executor state shows as KILLED even the application is finished normally
> ------------------------------------------------------------------------
>
>                 Key: SPARK-1118
>                 URL: https://issues.apache.org/jira/browse/SPARK-1118
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>            Reporter: Nan Zhu
>             Fix For: 1.0.0
>
>
> This seems weird, ExecutorState has no option of FINISHED, a terminated 
> executor can only be KILLED, FAILED, LOST



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to