sure Gour, would like to take up this task and contribute. Thanks for the pointers for me to proceed with, I will get in touch with you incase If I need any more help.
And wrt kill -s TERM on main.py processes (tried on both parent and child process independently), please find the result as follows : In none of the cases, application was killed. *1) Slider app created, and its running (not stopped)* *1.1) kill 'bash main.py' process* - it killed both 'bash main.py' and its 'child main.py' process - but the application process (nimbus) still running SliderAgent.log : *INFO 2015-05-14 12:17:06,591 Controller.py:497 - Component states (result): Expected: 4 and Actual: 5* *ERROR 2015-05-14 12:17:06,596 Controller.py:289 - Got terminateAgent command* *INFO 2015-05-14 12:17:16,597 Controller.py:217 - Terminate agent command received from AM, stopping the agent ...* *INFO 2015-05-14 12:17:16,597 ProcessHelper.py:39 - Removing pid file* *WARNING 2015-05-14 12:17:16,598 ProcessHelper.py:44 - Unable to remove pid file: [Errno 2] No such file or directory: '/grid/4/hadoop/yarn/local/usercache/yarn/appcache/application_1431424102217_0003/container_1431424102217_0003_01_000002/infra/run/agent.pid'* *INFO 2015-05-14 12:17:16,598 ProcessHelper.py:46 - Removing temp files* *1.2) kill 'child main.py' process* - it also killed both 'bash main.py' and its 'child main.py' process - but the application process (nimbus) still running SliderAgent.log : *INFO 2015-05-14 12:25:37,990 main.py:56 - signal received, exiting.* *INFO 2015-05-14 12:25:37,990 ProcessHelper.py:39 - Removing pid file* *INFO 2015-05-14 12:25:37,990 ProcessHelper.py:46 - Removing temp files* *2) Slider app created, and its stopped.* *2.1) kill 'bash main.py' process* - it killed only 'bash main.py' and not 'child main.py' process - And application process (nimbus) still running - there is *no logs came in SliderAgent* - And container logs are completely cleared by the time this action is done *2.2) kill 'child main.py' process* - it killed both 'bash main.py' and its 'child main.py' process - And application process (nimbus) still running - And container logs are completely cleared by the time this action is done SliderAgent.log : *INFO 2015-05-14 12:48:25,589 main.py:56 - signal received, exiting.* *INFO 2015-05-14 12:48:25,589 ProcessHelper.py:39 - Removing pid file* *INFO 2015-05-14 12:48:25,590 ProcessHelper.py:46 - Removing temp files* On Thu, May 14, 2015 at 1:38 AM, Gour Saha <[email protected]> wrote: > Hi Chackra, > > You are absolutely right. The workaround that I was planning to work on, > should be implemented as a neat backup solution, when YARN fails to > shutdown containers (in this and certain other possible scenarios). > > In fact, we had filed a bug long time back along the same lines, > predicting this issue (for another scenario) - > https://issues.apache.org/jira/browse/SLIDER-479 > > As you had expressed interest to contribute to Slider, I was thinking if > you would have some cycles and be willing to take this up. You can work on > the develop branch and use SLIDER-479. Slider develop branch is compatible > with HDP 2.2, so we can easily test the fix in your cluster. > > Let me know, and I can help all along the way. > > In case you have some cycles, here are some pointers that might help you > to approach this problem - > > 1. Slider has a notion of sending a terminate command to the agent which > the agent obeys and gracefully brings itself down > 2. In this scenario since Slider AM goes down, the agents can look for a > node in Zookeeper (when it looses connection with AM) and shut themselves > down if the node is missing (using the terminate code path or something > more elegant) > 3. Of course this Zookeeper node needs to be created by Slider AM in the > beginning of create cluster and then deleted just before the AM shuts down > as part of the stop command (might have to look into YARN pre-emption > scenario, but we can ignore this for now). We do not want to delete this in > AM failure/restart scenario. > 4. Any other better ideas or elegant solution you can think of > > On a side note, can you test this in debian 7 - > Go to one of the nodes where any of the agents are running (say NIMBUS or > any other component) and then issue a SIGTERM to the main.py process (kill > -s TERM <pid>). What do you see in the slider-agent.log after that? What > happens to all the processes in this container? Are they still running? > > The <pid> is that of the bash main.py process (not the python main.py > child process). > > So if the process is something like this - > yarn 6007 6003 0 19:43 ? 00:00:00 /bin/bash -c python > ./infra/agent/slider-agent/agent/main.py --label > container_1431413628146_0003_01_000002___NIMBUS --zk-quorum > c6408.ambari.apache.org:2181 --zk-reg-path > /registry/users/yarn/services/org-apache-slider/storm_1 > > /hadoop/yarn/log/application_1431413628146_0003/container_1431413628146_0003_01_000002/slider-agent.out > 2>&1 > > You need to issue - > kill -s TERM 6007 > > -Gour > > On 5/13/15, 1:38 AM, "Chackravarthy Esakkimuthu" <[email protected] > <mailto:[email protected]>> wrote: > > Thanks for your response steve, > > I was thinking that SliderAgent would receive 'stop' command from SliderAM > to kill the components spawned by those agents. And yeah this might be > specific to debian installation as others in the group are not facing this > issue. > > On Tue, May 12, 2015 at 1:50 PM, Steve Loughran <[email protected] > <mailto:[email protected]>> > wrote: > > > > On 12 May 2015, at 08:42, Chackravarthy Esakkimuthu < > [email protected]<mailto:[email protected]>> wrote: > > > > Starting a new thread, > > > > already JIRA filed for the same by Gour, > > https://issues.apache.org/jira/browse/YARN-3561 > > > > Slider stop does not stop the components started by slider, instead it > > stops only SliderAM, and even SliderAgents did not receive 'stop' > command. > > (it happens with debian 7) and tested with 0.70.1 as well as 'develop' > > branch code. > > > > Today I just came across the following mail archive, > > > > > > http://mail-archives.apache.org/mod_mbox/incubator-slider-dev/201503.mbox/%[email protected]%3E > > > > <<<<<<<<<<<<<<<<<<<<<<<<<<<<< > > > > *What is not implemented is an explicit call to "stop function in the > > python scripts". > > > > What I was referring to that an attempt is made by the Agent to call > > stop in the python script > > but it is not guaranteed. The reason it is not guaranteed is that the > > call to stop() and kill > > of the containers by YARN is not co-ordinated. > > > > In summary, the ability to call stop() functions in the python script > > is not implemented. > > Its in the plan though.* > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > > > Does this still exists? > > > the idea of stop|() command is to actually offer a best-effort clean > shutdown for containers. Currently the AM just directly tells YARN to > destroy a container. The agent doesn't get told, nor does the application > (that's implicit from the agent). > > YARN is expected to "kill" then, if there is no response, "kill -9" the > agent process. Which it does for the hosts we test on, linux, OSX and > windows. > > IF something is up with your YARN+debian installation, we believe that it > is related to whether those container kill events are coming out from the > node manager. > > >
