[ https://issues.apache.org/jira/browse/SLIDER-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
kyungwan nam updated SLIDER-1253: --------------------------------- Attachment: SLIDER-1253.patch attaches patch. I welcome reviewing or any comments. > All containers are killed when decommission a NM which the AM is placed. > ------------------------------------------------------------------------ > > Key: SLIDER-1253 > URL: https://issues.apache.org/jira/browse/SLIDER-1253 > Project: Slider > Issue Type: Bug > Affects Versions: Slider 0.92 > Reporter: kyungwan nam > Attachments: SLIDER-1253.patch > > > Once a nodemanager is decommissioned, RM release containers running in the > nodemanager immediately. and new appattempt will be launched if the released > container is AM. > RM log. > {code} > 2017-11-30 09:11:31,351 INFO rmnode.RMNodeImpl > (RMNodeImpl.java:transition(734)) - Deactivating Node host1:45454 as it is > now DECOMMISSIONED > 2017-11-30 09:11:31,351 INFO rmnode.RMNodeImpl (RMNodeImpl.java:handle(424)) > - host1:45454 Node Transitioned from RUNNING to DECOMMISSIONED > 2017-11-30 09:11:31,352 INFO rmcontainer.RMContainerImpl > (RMContainerImpl.java:handle(384)) - > container_e12_1487083747959_0214_01_000001 Container Transitioned from > RUNNING to KILLED > 2017-11-30 09:11:31,352 ERROR ahs.RMApplicationHistoryWriter > (RMApplicationHistoryWriter.java:handleWritingApplicationHistoryEvent(214)) - > Error when storing the finish data of container > container_e12_1487083747959_0214_01_000001 > 2017-11-30 09:11:31,352 INFO fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:containerCompleted(123)) - Completed container: > container_e12_1487083747959_0214_01_000001 in state: KILLED event:KILL > 2017-11-30 09:11:31,352 INFO resourcemanager.RMAuditLogger > (RMAuditLogger.java:logSuccess(106)) - USER=user1 OPERATION=AM Released > Container TARGET=SchedulerApp RESULT=SUCCESS > APPID=application_1487083747959_0214 > CONTAINERID=container_e12_1487083747959_0214_01_000001 > 2017-11-30 09:11:31,352 INFO scheduler.SchedulerNode > (SchedulerNode.java:releaseContainer(217)) - Released container > container_e12_1487083747959_0214_01_000001 of capacity <memory:1024, > vCores:1> on host host1:45454, which currently has 0 containers, <memory:0, > vCores:0> used and <memory:120000, vCores:24> available, release > resources=true > 2017-11-30 09:11:31,352 INFO attempt.RMAppAttemptImpl > (RMAppAttemptImpl.java:rememberTargetTransitionsAndStoreState(1169)) - > Updating application attempt appattempt_1487083747959_0214_000001 with final > state: FAILED, and exit status: -100 > ... > 2017-11-30 09:11:31,354 INFO rmapp.RMAppImpl (RMAppImpl.java:handle(721)) - > application_1487083747959_0214 State change from RUNNING to ACCEPTED > 2017-11-30 09:11:31,354 INFO capacity.CapacityScheduler > (CapacityScheduler.java:doneApplicationAttempt(818)) - Application Attempt > appattempt_1487083747959_0214_000001 is done. finalState=FAILED > 2017-11-30 09:11:31,354 INFO resourcemanager.ApplicationMasterService > (ApplicationMasterService.java:registerAppAttempt(675)) - Registering app > attempt : appattempt_1487083747959_0214_000002 > {code} > At this time, in the AM which has not been released yet actually, the > ApplicationAttemptNotFoundException can happens due to different appattempt > between AM and RM. > As a result, AMRMClientAsync.onShutdownRequest callback will be called. > AM log for container_e12_1487083747959_0214_01_000001 > {code} > 17/11/30 09:11:32 INFO appmaster.SliderAppMaster: Shutdown Request received > 17/11/30 09:11:32 INFO impl.AMRMClientAsyncImpl: Shutdown requested. Stopping > callback. > 17/11/30 09:11:32 INFO appmaster.SliderAppMaster: > SliderAppMasterApi.stopCluster: Shutdown requested from RM > 17/11/30 09:11:32 INFO appmaster.SliderAppMaster: Triggering shutdown of the > AM: stop: exit code = 0, SUCCEEDED: Shutdown requested from RM; > 17/11/30 09:11:32 INFO appmaster.SliderAppMaster: Process has exited with > exit code 0 mapped to 0 -ignoring > 17/11/30 09:11:32 INFO appmaster.SliderAppMaster: Setting stopInitiated flag > to true > 17/11/30 09:11:32 INFO appmaster.SliderAppMaster: Container release timeout > in millis = 0 > 17/11/30 09:11:32 INFO state.AppState: Releasing 11 containers > {code} > Currently, the entire application is stopped in the onShutdownRequest > callback. > {code} > public void onShutdownRequest() { > LOG_YARN.info("Shutdown Request received"); > ActionStopSlider stopSlider = new ActionStopSlider("stop", EXIT_SUCCESS, > FinalApplicationStatus.SUCCEEDED, "Shutdown requested from RM"); > stopSlider.setExitReason(SliderExitReason.YARN_ERROR); > signalAMComplete(stopSlider); > } > {code} > I think it needs to stop AM only instead of stopping entire application. -- This message was sent by Atlassian JIRA (v6.4.14#64029)