[ 
https://issues.apache.org/jira/browse/SPARK-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14250762#comment-14250762
 ] 

David McWhorter commented on SPARK-4069:
----------------------------------------

Seeing the same behavior, a spark application fails and the FinalStatus gets 
set to FAILED but State hangs in FINISHING.  The application does not release 
its resources.  It continues processing for some time in this state and 
eventually finishes and the State transitions to FINISHED and the resources are 
released.  But there is no way to kill the application in this state and force 
it to release its executors.
Using Hadoop 2.2.0 and Spark 1.0.1

> [SPARK-YARN] ApplicationMaster should release all executors' containers 
> before unregistering itself from Yarn RM
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-4069
>                 URL: https://issues.apache.org/jira/browse/SPARK-4069
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.1.0
>            Reporter: Min Zhou
>
> Curently,  ApplciationMaster in yarn mode simply unregister itself from yarn 
> master , a.k.a resourcemanager.  Itnever release executors' containers before 
> that.  Yarn's master will make a decision to kill all the executors' 
> containers if it face such scenario.  so the log of resourcemanager is like 
> below 
> {noformat}
> 2014-10-22 23:39:09,903 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> Processing event for appattempt_1414003182949_0004_000001 of type UNREGISTERED
> 2014-10-22 23:39:09,903 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1414003182949_0004_000001 State change from RUNNING to FINAL_SAVING
> 2014-10-22 23:39:09,903 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating 
> application application_1414003182949_0004 with final state: FINISHING
> 2014-10-22 23:39:09,903 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
> application_1414003182949_0004 State change from RUNNING to FINAL_SAVING
> 2014-10-22 23:39:09,903 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> Processing event for appattempt_1414003182949_0004_000001 of type 
> ATTEMPT_UPDATE_SAVED
> 2014-10-22 23:39:09,903 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing 
> info for app: application_1414003182949_0004
> 2014-10-22 23:39:09,903 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1414003182949_0004_000001 State change from FINAL_SAVING to 
> FINISHING
> 2014-10-22 23:39:09,903 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
> application_1414003182949_0004 State change from FINAL_SAVING to FINISHING
> 2014-10-22 23:39:10,485 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> Processing event for appattempt_1414003182949_0004_000001 of type 
> CONTAINER_FINISHED
> 2014-10-22 23:39:10,485 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_1414003182949_0004_01_000001 Container Transitioned from RUNNING to 
> COMPLETED
> 2014-10-22 23:39:10,485 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
> Unregistering app attempt : appattempt_1414003182949_0004_000001
> 2014-10-22 23:39:10,485 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp: 
> Completed container: container_1414003182949_0004_01_000001 in state: 
> COMPLETED event:FINISHED
> 2014-10-22 23:39:10,485 INFO 
> org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore:
>  Finish information of container container_1414003182949_0004_01_000001 is 
> written
> 2014-10-22 23:39:10,485 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1414003182949_0004_000001 State change from FINISHING to FINISHED
> 2014-10-22 23:39:10,485 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=akim   
> OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  
> APPID=application_1414003182949_0004    
> CONTAINERID=container_1414003182949_0004_01_000001
> 2014-10-22 23:39:10,485 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Stored the finish data of container container_1414003182949_0004_01_000001
> 2014-10-22 23:39:10,485 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: 
> Released container container_1414003182949_0004_01_000001 of capacity 
> <memory:3072, vCores:1> on host host1, which currently has 0 containers, 
> <memory:0, vCores:0> used and <memory:241901, vCores:32> available, release 
> resources=true
> 2014-10-22 23:39:10,485 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
> application_1414003182949_0004 State change from FINISHING to FINISHED
> 2014-10-22 23:39:10,485 INFO 
> org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore:
>  Finish information of application attempt 
> appattempt_1414003182949_0004_000001 is written
> 2014-10-22 23:39:10,485 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=akim   
> OPERATION=Application Finished - Succeeded      TARGET=RMAppManager     
> RESULT=SUCCESS  APPID=application_1414003182949_0004
> 2014-10-22 23:39:10,485 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application attempt appattempt_1414003182949_0004_000001 released container 
> container_1414003182949_0004_01_000001 on node: host: host2:8041 
> #containers=0 available=<memory:241901, vCores:32> used=<memory:0, vCores:0> 
> with event: FINISHED
> 2014-10-22 23:39:10,485 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Stored the finish data of application attempt 
> appattempt_1414003182949_0004_000001
> 2014-10-22 23:39:10,485 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application appattempt_1414003182949_0004_000001 is done. finalState=FINISHED
> 2014-10-22 23:39:10,486 INFO 
> org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore:
>  Finish information of application application_1414003182949_0004 is written
> 2014-10-22 23:39:10,486 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_1414003182949_0004_01_000019 Container Transitioned from RUNNING to 
> KILLED
> {noformat}
> Although it won't affect the job's final succeed status, but the log will 
> confuse users. 
> If we run a  spark job on yarn 2.4.1 with timeline server enabled,  we will 
> get errors on the resourcemanager's log
> {noformat}
> 2014-10-22 23:39:10,637 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Error when storing the finish data of container 
> container_1414003182949_0004_01_000019
> 2014-10-22 23:39:10,637 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Error when storing the finish data of container 
> container_1414003182949_0004_01_000017
> 2014-10-22 23:39:10,637 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Error when storing the finish data of container 
> container_1414003182949_0004_01_000009
> 2014-10-22 23:39:10,637 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Error when storing the finish data of container 
> container_1414003182949_0004_01_000010
> 2014-10-22 23:39:10,637 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Error when storing the finish data of container 
> container_1414003182949_0004_01_000012
> 2014-10-22 23:39:10,637 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Error when storing the finish data of container 
> container_1414003182949_0004_01_000003
> 2014-10-22 23:39:10,637 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Error when storing the finish data of container 
> container_1414003182949_0004_01_000005
> 2014-10-22 23:39:10,637 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Error when storing the finish data of container 
> container_1414003182949_0004_01_000004
> 2014-10-22 23:39:10,637 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Error when storing the finish data of container 
> container_1414003182949_0004_01_000015
> 2014-10-22 23:39:10,637 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Error when storing the finish data of container 
> container_1414003182949_0004_01_000018
> 2014-10-22 23:39:10,637 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Error when storing the finish data of container 
> container_1414003182949_0004_01_000013
> 2014-10-22 23:39:10,637 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Error when storing the finish data of container 
> container_1414003182949_0004_01_000008
> 2014-10-22 23:39:10,637 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Error when storing the finish data of container 
> container_1414003182949_0004_01_000014
> 2014-10-22 23:39:10,637 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Error when storing the finish data of container 
> container_1414003182949_0004_01_000007
> 2014-10-22 23:39:10,638 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Error when storing the finish data of container 
> container_1414003182949_0004_01_000002
> {noformat}
> This is because the application is finished before containers being 
> terminated.  Once the executors' containers being killed,  resourcemanager 
> will try to log something for containers' finsih event, but can't find a 
> writer due to the application  finished before that.  
> {noformat}
> java.io.IOException: History file of application 
> application_1414003182949_0003 is not opened
>     
> org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.getHistoryFileWriter(FileSystemApplicationHistoryStore.java:643)
>     
> org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.containerFinished(FileSystemApplicationHistoryStore.java:532)
>     
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter.handleWritingApplicationHistoryEvent(RMApplicationHistoryWriter.java:203)
>     
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter$ForwardingEventHandler.handle(RMApplicationHistoryWriter.java:297)
>     
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter$ForwardingEventHandler.handle(RMApplicationHistoryWriter.java:292)
>     
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>     
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
>     java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to