Min Zhou created SPARK-4069:
-------------------------------

             Summary: [SPARK-YARN] ApplicationMaster should releases all 
executors' containers before unregistering itself from Yarn RM
                 Key: SPARK-4069
                 URL: https://issues.apache.org/jira/browse/SPARK-4069
             Project: Spark
          Issue Type: Bug
          Components: YARN
    Affects Versions: 1.1.0
            Reporter: Min Zhou


Curently,  ApplciationMaster in yarn mode simply unregister itself from yarn 
master , a.k.a resourcemanager.  Itnever release executors' containers before 
that.  Yarn's master will make a decision to kill all the executors' containers 
if it face such scenario.  so the log of resourcemanager is like below 

{noformat}
2014-10-22 23:39:09,903 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
Processing event for appattempt_1414003182949_0004_000001 of type UNREGISTERED
2014-10-22 23:39:09,903 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_1414003182949_0004_000001 State change from RUNNING to FINAL_SAVING
2014-10-22 23:39:09,903 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating 
application application_1414003182949_0004 with final state: FINISHING
2014-10-22 23:39:09,903 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
application_1414003182949_0004 State change from RUNNING to FINAL_SAVING
2014-10-22 23:39:09,903 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
Processing event for appattempt_1414003182949_0004_000001 of type 
ATTEMPT_UPDATE_SAVED
2014-10-22 23:39:09,903 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing 
info for app: application_1414003182949_0004
2014-10-22 23:39:09,903 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_1414003182949_0004_000001 State change from FINAL_SAVING to FINISHING
2014-10-22 23:39:09,903 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
application_1414003182949_0004 State change from FINAL_SAVING to FINISHING
2014-10-22 23:39:10,485 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
Processing event for appattempt_1414003182949_0004_000001 of type 
CONTAINER_FINISHED
2014-10-22 23:39:10,485 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
container_1414003182949_0004_01_000001 Container Transitioned from RUNNING to 
COMPLETED
2014-10-22 23:39:10,485 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
Unregistering app attempt : appattempt_1414003182949_0004_000001
2014-10-22 23:39:10,485 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp: 
Completed container: container_1414003182949_0004_01_000001 in state: COMPLETED 
event:FINISHED
2014-10-22 23:39:10,485 INFO 
org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore:
 Finish information of container container_1414003182949_0004_01_000001 is 
written
2014-10-22 23:39:10,485 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_1414003182949_0004_000001 State change from FINISHING to FINISHED
2014-10-22 23:39:10,485 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=akim     
OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  
APPID=application_1414003182949_0004    
CONTAINERID=container_1414003182949_0004_01_000001
2014-10-22 23:39:10,485 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
Stored the finish data of container container_1414003182949_0004_01_000001
2014-10-22 23:39:10,485 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: 
Released container container_1414003182949_0004_01_000001 of capacity 
<memory:3072, vCores:1> on host host1, which currently has 0 containers, 
<memory:0, vCores:0> used and <memory:241901, vCores:32> available, release 
resources=true
2014-10-22 23:39:10,485 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
application_1414003182949_0004 State change from FINISHING to FINISHED
2014-10-22 23:39:10,485 INFO 
org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore:
 Finish information of application attempt appattempt_1414003182949_0004_000001 
is written
2014-10-22 23:39:10,485 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=akim     
OPERATION=Application Finished - Succeeded      TARGET=RMAppManager     
RESULT=SUCCESS  APPID=application_1414003182949_0004
2014-10-22 23:39:10,485 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Application attempt appattempt_1414003182949_0004_000001 released container 
container_1414003182949_0004_01_000001 on node: host: host2:8041 #containers=0 
available=<memory:241901, vCores:32> used=<memory:0, vCores:0> with event: 
FINISHED
2014-10-22 23:39:10,485 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
Stored the finish data of application attempt 
appattempt_1414003182949_0004_000001
2014-10-22 23:39:10,485 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Application appattempt_1414003182949_0004_000001 is done. finalState=FINISHED
2014-10-22 23:39:10,486 INFO 
org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore:
 Finish information of application application_1414003182949_0004 is written
2014-10-22 23:39:10,486 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
container_1414003182949_0004_01_000019 Container Transitioned from RUNNING to 
KILLED
{noformat}
Although it won't affect the job's final succeed status, but the log will 
confuse users. 

If we run a  spark job on yarn 2.4.1 with timeline server enabled,  we will get 
errors on the resourcemanager's log
{noformat}
2014-10-22 23:39:10,637 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
Error when storing the finish data of container 
container_1414003182949_0004_01_000019
2014-10-22 23:39:10,637 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
Error when storing the finish data of container 
container_1414003182949_0004_01_000017
2014-10-22 23:39:10,637 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
Error when storing the finish data of container 
container_1414003182949_0004_01_000009
2014-10-22 23:39:10,637 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
Error when storing the finish data of container 
container_1414003182949_0004_01_000010
2014-10-22 23:39:10,637 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
Error when storing the finish data of container 
container_1414003182949_0004_01_000012
2014-10-22 23:39:10,637 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
Error when storing the finish data of container 
container_1414003182949_0004_01_000003
2014-10-22 23:39:10,637 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
Error when storing the finish data of container 
container_1414003182949_0004_01_000005
2014-10-22 23:39:10,637 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
Error when storing the finish data of container 
container_1414003182949_0004_01_000004
2014-10-22 23:39:10,637 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
Error when storing the finish data of container 
container_1414003182949_0004_01_000015
2014-10-22 23:39:10,637 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
Error when storing the finish data of container 
container_1414003182949_0004_01_000018
2014-10-22 23:39:10,637 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
Error when storing the finish data of container 
container_1414003182949_0004_01_000013
2014-10-22 23:39:10,637 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
Error when storing the finish data of container 
container_1414003182949_0004_01_000008
2014-10-22 23:39:10,637 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
Error when storing the finish data of container 
container_1414003182949_0004_01_000014
2014-10-22 23:39:10,637 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
Error when storing the finish data of container 
container_1414003182949_0004_01_000007
2014-10-22 23:39:10,638 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
Error when storing the finish data of container 
container_1414003182949_0004_01_000002
{noformat}
This is because the application is finished before containers being terminated. 
 Once the executors' containers being killed,  resourcemanager will try to log 
something for containers' finsih event, but can't find a writer due to the 
application  finished before that.  

{noformat}
java.io.IOException: History file of application application_1414003182949_0003 
is not opened
    
org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.getHistoryFileWriter(FileSystemApplicationHistoryStore.java:643)
    
org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.containerFinished(FileSystemApplicationHistoryStore.java:532)
    
org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter.handleWritingApplicationHistoryEvent(RMApplicationHistoryWriter.java:203)
    
org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter$ForwardingEventHandler.handle(RMApplicationHistoryWriter.java:297)
    
org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter$ForwardingEventHandler.handle(RMApplicationHistoryWriter.java:292)
    
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
    org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
    java.lang.Thread.run(Thread.java:745)

{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to