Hi Chackra, You are absolutely right. The workaround that I was planning to work on, should be implemented as a neat backup solution, when YARN fails to shutdown containers (in this and certain other possible scenarios).
In fact, we had filed a bug long time back along the same lines, predicting this issue (for another scenario) - https://issues.apache.org/jira/browse/SLIDER-479 As you had expressed interest to contribute to Slider, I was thinking if you would have some cycles and be willing to take this up. You can work on the develop branch and use SLIDER-479. Slider develop branch is compatible with HDP 2.2, so we can easily test the fix in your cluster. Let me know, and I can help all along the way. In case you have some cycles, here are some pointers that might help you to approach this problem - 1. Slider has a notion of sending a terminate command to the agent which the agent obeys and gracefully brings itself down 2. In this scenario since Slider AM goes down, the agents can look for a node in Zookeeper (when it looses connection with AM) and shut themselves down if the node is missing (using the terminate code path or something more elegant) 3. Of course this Zookeeper node needs to be created by Slider AM in the beginning of create cluster and then deleted just before the AM shuts down as part of the stop command (might have to look into YARN pre-emption scenario, but we can ignore this for now). We do not want to delete this in AM failure/restart scenario. 4. Any other better ideas or elegant solution you can think of On a side note, can you test this in debian 7 - Go to one of the nodes where any of the agents are running (say NIMBUS or any other component) and then issue a SIGTERM to the main.py process (kill -s TERM <pid>). What do you see in the slider-agent.log after that? What happens to all the processes in this container? Are they still running? The <pid> is that of the bash main.py process (not the python main.py child process). So if the process is something like this - yarn 6007 6003 0 19:43 ? 00:00:00 /bin/bash -c python ./infra/agent/slider-agent/agent/main.py --label container_1431413628146_0003_01_000002___NIMBUS --zk-quorum c6408.ambari.apache.org:2181 --zk-reg-path /registry/users/yarn/services/org-apache-slider/storm_1 > /hadoop/yarn/log/application_1431413628146_0003/container_1431413628146_0003_01_000002/slider-agent.out 2>&1 You need to issue - kill -s TERM 6007 -Gour On 5/13/15, 1:38 AM, "Chackravarthy Esakkimuthu" <[email protected]<mailto:[email protected]>> wrote: Thanks for your response steve, I was thinking that SliderAgent would receive 'stop' command from SliderAM to kill the components spawned by those agents. And yeah this might be specific to debian installation as others in the group are not facing this issue. On Tue, May 12, 2015 at 1:50 PM, Steve Loughran <[email protected]<mailto:[email protected]>> wrote: > On 12 May 2015, at 08:42, Chackravarthy Esakkimuthu < [email protected]<mailto:[email protected]>> wrote: > > Starting a new thread, > > already JIRA filed for the same by Gour, > https://issues.apache.org/jira/browse/YARN-3561 > > Slider stop does not stop the components started by slider, instead it > stops only SliderAM, and even SliderAgents did not receive 'stop' command. > (it happens with debian 7) and tested with 0.70.1 as well as 'develop' > branch code. > > Today I just came across the following mail archive, > > http://mail-archives.apache.org/mod_mbox/incubator-slider-dev/201503.mbox/%[email protected]%3E > > <<<<<<<<<<<<<<<<<<<<<<<<<<<<< > > *What is not implemented is an explicit call to "stop function in the > python scripts". > > What I was referring to that an attempt is made by the Agent to call > stop in the python script > but it is not guaranteed. The reason it is not guaranteed is that the > call to stop() and kill > of the containers by YARN is not co-ordinated. > > In summary, the ability to call stop() functions in the python script > is not implemented. > Its in the plan though.* > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > Does this still exists? the idea of stop|() command is to actually offer a best-effort clean shutdown for containers. Currently the AM just directly tells YARN to destroy a container. The agent doesn't get told, nor does the application (that's implicit from the agent). YARN is expected to "kill" then, if there is no response, "kill -9" the agent process. Which it does for the hosts we test on, linux, OSX and windows. IF something is up with your YARN+debian installation, we believe that it is related to whether those container kill events are coming out from the node manager.
