[jira] [Updated] (SLIDER-1253) All containers are killed when decommission a NM which the AM is placed.

kyungwan nam (JIRA) Sun, 03 Dec 2017 23:08:02 -0800

     [ 
https://issues.apache.org/jira/browse/SLIDER-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


kyungwan nam updated SLIDER-1253:
---------------------------------
    Attachment: SLIDER-1253.patch

attaches patch.
I welcome reviewing or any comments.

> All containers are killed when decommission a NM which the AM is placed.
> ------------------------------------------------------------------------
>
>                 Key: SLIDER-1253
>                 URL: https://issues.apache.org/jira/browse/SLIDER-1253
>             Project: Slider
>          Issue Type: Bug
>    Affects Versions: Slider 0.92
>            Reporter: kyungwan nam
>         Attachments: SLIDER-1253.patch
>
>
> Once a nodemanager is decommissioned, RM release containers running in the 
> nodemanager immediately. and new appattempt will be launched if the released 
> container is AM.
> RM log.
> {code}
> 2017-11-30 09:11:31,351 INFO  rmnode.RMNodeImpl 
> (RMNodeImpl.java:transition(734)) - Deactivating Node host1:45454 as it is 
> now DECOMMISSIONED
> 2017-11-30 09:11:31,351 INFO  rmnode.RMNodeImpl (RMNodeImpl.java:handle(424)) 
> - host1:45454 Node Transitioned from RUNNING to DECOMMISSIONED
> 2017-11-30 09:11:31,352 INFO  rmcontainer.RMContainerImpl 
> (RMContainerImpl.java:handle(384)) - 
> container_e12_1487083747959_0214_01_000001 Container Transitioned from 
> RUNNING to KILLED
> 2017-11-30 09:11:31,352 ERROR ahs.RMApplicationHistoryWriter 
> (RMApplicationHistoryWriter.java:handleWritingApplicationHistoryEvent(214)) - 
> Error when storing the finish data of container 
> container_e12_1487083747959_0214_01_000001
> 2017-11-30 09:11:31,352 INFO  fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:containerCompleted(123)) - Completed container: 
> container_e12_1487083747959_0214_01_000001 in state: KILLED event:KILL
> 2017-11-30 09:11:31,352 INFO  resourcemanager.RMAuditLogger 
> (RMAuditLogger.java:logSuccess(106)) - USER=user1    OPERATION=AM Released 
> Container TARGET=SchedulerApp     RESULT=SUCCESS  
> APPID=application_1487083747959_0214    
> CONTAINERID=container_e12_1487083747959_0214_01_000001
> 2017-11-30 09:11:31,352 INFO  scheduler.SchedulerNode 
> (SchedulerNode.java:releaseContainer(217)) - Released container 
> container_e12_1487083747959_0214_01_000001 of capacity <memory:1024, 
> vCores:1> on host host1:45454, which currently has 0 containers, <memory:0, 
> vCores:0> used and <memory:120000, vCores:24> available, release 
> resources=true
> 2017-11-30 09:11:31,352 INFO  attempt.RMAppAttemptImpl 
> (RMAppAttemptImpl.java:rememberTargetTransitionsAndStoreState(1169)) - 
> Updating application attempt appattempt_1487083747959_0214_000001 with final 
> state: FAILED, and exit status: -100
> ...
> 2017-11-30 09:11:31,354 INFO  rmapp.RMAppImpl (RMAppImpl.java:handle(721)) - 
> application_1487083747959_0214 State change from RUNNING to ACCEPTED
> 2017-11-30 09:11:31,354 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:doneApplicationAttempt(818)) - Application Attempt 
> appattempt_1487083747959_0214_000001 is done. finalState=FAILED
> 2017-11-30 09:11:31,354 INFO  resourcemanager.ApplicationMasterService 
> (ApplicationMasterService.java:registerAppAttempt(675)) - Registering app 
> attempt : appattempt_1487083747959_0214_000002
> {code}
> At this time, in the AM which has not been released yet actually, the 
> ApplicationAttemptNotFoundException can happens due to different appattempt 
> between AM and RM.
> As a result, AMRMClientAsync.onShutdownRequest callback will be called.
> AM log for container_e12_1487083747959_0214_01_000001
> {code}
> 17/11/30 09:11:32 INFO appmaster.SliderAppMaster: Shutdown Request received
> 17/11/30 09:11:32 INFO impl.AMRMClientAsyncImpl: Shutdown requested. Stopping 
> callback.
> 17/11/30 09:11:32 INFO appmaster.SliderAppMaster: 
> SliderAppMasterApi.stopCluster: Shutdown requested from RM
> 17/11/30 09:11:32 INFO appmaster.SliderAppMaster: Triggering shutdown of the 
> AM: stop:  exit code = 0, SUCCEEDED: Shutdown requested from RM;
> 17/11/30 09:11:32 INFO appmaster.SliderAppMaster: Process has exited with 
> exit code 0 mapped to 0 -ignoring
> 17/11/30 09:11:32 INFO appmaster.SliderAppMaster: Setting stopInitiated flag 
> to true
> 17/11/30 09:11:32 INFO appmaster.SliderAppMaster: Container release timeout 
> in millis = 0
> 17/11/30 09:11:32 INFO state.AppState: Releasing 11 containers
> {code}
> Currently, the entire application is stopped in the onShutdownRequest 
> callback.
> {code}
>    public void onShutdownRequest() {
>      LOG_YARN.info("Shutdown Request received");
>      ActionStopSlider stopSlider = new ActionStopSlider("stop", EXIT_SUCCESS,
>          FinalApplicationStatus.SUCCEEDED, "Shutdown requested from RM");
>      stopSlider.setExitReason(SliderExitReason.YARN_ERROR);
>      signalAMComplete(stopSlider);
>    }
> {code}
> I think it needs to stop AM only instead of stopping entire application.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (SLIDER-1253) All containers are killed when decommission a NM which the AM is placed.

Reply via email to