Chackra, This is wonderful. Thanks for debugging this and finding a solution.
Now that we know it works for Debian, we have to check if this modified syntax will work for other OSes? If yes, then we can make a machine independent change. Otherwise we have to make OS specific code change. I would suggest you submit a patch for this change on hadoop trunk branch. You should get credit for it. -Gour - Sent from my iPhone > On May 17, 2015, at 11:05 AM, "Chackravarthy Esakkimuthu" > <[email protected]> wrote: > > Gour/Steve, > > The issue was because of improper kill command construction by > DefaultContainerExecutor, and hence kill SIGTERM itself was not issued to > SliderAgent, hence all the agents as well as components continue to run. > > I made one change in Shell.java (hadoop-common) to construct the kill > command including two hyphens, then now slider stop works properly:) > > It was, *kill -signalNo -<process_id>* > changed as, *kill -signalNo -- -<process_id>* > > I have update the same in JIRA as well, > > https://issues.apache.org/jira/browse/YARN-3561 > > > Thanks, > Chackra > > > On Thu, May 14, 2015 at 1:03 PM, Chackravarthy Esakkimuthu < > [email protected]> wrote: > >> sure Gour, would like to take up this task and contribute. Thanks for the >> pointers for me to proceed with, I will get in touch with you incase If I >> need any more help. >> >> And wrt kill -s TERM on main.py processes (tried on both parent and child >> process independently), please find the result as follows : >> >> In none of the cases, application was killed. >> >> *1) Slider app created, and its running (not stopped)* >> >> *1.1) kill 'bash main.py' process* >> >> - it killed both 'bash main.py' and its 'child main.py' process >> - but the application process (nimbus) still running >> >> >> SliderAgent.log : >> >> *INFO 2015-05-14 12:17:06,591 Controller.py:497 - Component states >> (result): Expected: 4 and Actual: 5* >> *ERROR 2015-05-14 12:17:06,596 Controller.py:289 - Got terminateAgent >> command* >> *INFO 2015-05-14 12:17:16,597 Controller.py:217 - Terminate agent command >> received from AM, stopping the agent ...* >> *INFO 2015-05-14 12:17:16,597 ProcessHelper.py:39 - Removing pid file* >> *WARNING 2015-05-14 12:17:16,598 ProcessHelper.py:44 - Unable to remove >> pid file: [Errno 2] No such file or directory: >> '/grid/4/hadoop/yarn/local/usercache/yarn/appcache/application_1431424102217_0003/container_1431424102217_0003_01_000002/infra/run/agent.pid'* >> *INFO 2015-05-14 12:17:16,598 ProcessHelper.py:46 - Removing temp files* >> *1.2) kill 'child main.py' process* >> >> - it also killed both 'bash main.py' and its 'child main.py' process >> - but the application process (nimbus) still running >> >> >> SliderAgent.log : >> >> *INFO 2015-05-14 12:25:37,990 main.py:56 - signal received, exiting.* >> *INFO 2015-05-14 12:25:37,990 ProcessHelper.py:39 - Removing pid file* >> *INFO 2015-05-14 12:25:37,990 ProcessHelper.py:46 - Removing temp files* >> >> >> *2) Slider app created, and its stopped.* >> >> *2.1) kill 'bash main.py' process* >> >> - it killed only 'bash main.py' and not 'child main.py' process >> - And application process (nimbus) still running >> - there is *no logs came in SliderAgent* >> - And container logs are completely cleared by the time this action is >> done >> >> *2.2) kill 'child main.py' process* >> >> - it killed both 'bash main.py' and its 'child main.py' process >> - And application process (nimbus) still running >> - And container logs are completely cleared by the time this action is >> done >> >> >> SliderAgent.log : >> >> *INFO 2015-05-14 12:48:25,589 main.py:56 - signal received, exiting.* >> *INFO 2015-05-14 12:48:25,589 ProcessHelper.py:39 - Removing pid file* >> *INFO 2015-05-14 12:48:25,590 ProcessHelper.py:46 - Removing temp files* >> >> >>> On Thu, May 14, 2015 at 1:38 AM, Gour Saha <[email protected]> wrote: >>> >>> Hi Chackra, >>> >>> You are absolutely right. The workaround that I was planning to work on, >>> should be implemented as a neat backup solution, when YARN fails to >>> shutdown containers (in this and certain other possible scenarios). >>> >>> In fact, we had filed a bug long time back along the same lines, >>> predicting this issue (for another scenario) - >>> https://issues.apache.org/jira/browse/SLIDER-479 >>> >>> As you had expressed interest to contribute to Slider, I was thinking if >>> you would have some cycles and be willing to take this up. You can work on >>> the develop branch and use SLIDER-479. Slider develop branch is compatible >>> with HDP 2.2, so we can easily test the fix in your cluster. >>> >>> Let me know, and I can help all along the way. >>> >>> In case you have some cycles, here are some pointers that might help you >>> to approach this problem - >>> >>> 1. Slider has a notion of sending a terminate command to the agent which >>> the agent obeys and gracefully brings itself down >>> 2. In this scenario since Slider AM goes down, the agents can look for a >>> node in Zookeeper (when it looses connection with AM) and shut themselves >>> down if the node is missing (using the terminate code path or something >>> more elegant) >>> 3. Of course this Zookeeper node needs to be created by Slider AM in the >>> beginning of create cluster and then deleted just before the AM shuts down >>> as part of the stop command (might have to look into YARN pre-emption >>> scenario, but we can ignore this for now). We do not want to delete this in >>> AM failure/restart scenario. >>> 4. Any other better ideas or elegant solution you can think of >>> >>> On a side note, can you test this in debian 7 - >>> Go to one of the nodes where any of the agents are running (say NIMBUS or >>> any other component) and then issue a SIGTERM to the main.py process (kill >>> -s TERM <pid>). What do you see in the slider-agent.log after that? What >>> happens to all the processes in this container? Are they still running? >>> >>> The <pid> is that of the bash main.py process (not the python main.py >>> child process). >>> >>> So if the process is something like this - >>> yarn 6007 6003 0 19:43 ? 00:00:00 /bin/bash -c python >>> ./infra/agent/slider-agent/agent/main.py --label >>> container_1431413628146_0003_01_000002___NIMBUS --zk-quorum >>> c6408.ambari.apache.org:2181 --zk-reg-path >>> /registry/users/yarn/services/org-apache-slider/storm_1 > >>> /hadoop/yarn/log/application_1431413628146_0003/container_1431413628146_0003_01_000002/slider-agent.out >>> 2>&1 >>> >>> You need to issue - >>> kill -s TERM 6007 >>> >>> -Gour >>> >>> On 5/13/15, 1:38 AM, "Chackravarthy Esakkimuthu" <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> Thanks for your response steve, >>> >>> I was thinking that SliderAgent would receive 'stop' command from SliderAM >>> to kill the components spawned by those agents. And yeah this might be >>> specific to debian installation as others in the group are not facing this >>> issue. >>> >>> On Tue, May 12, 2015 at 1:50 PM, Steve Loughran <[email protected] >>> <mailto:[email protected]>> >>> wrote: >>> >>> >>>>> On 12 May 2015, at 08:42, Chackravarthy Esakkimuthu < >>>> [email protected]<mailto:[email protected]>> wrote: >>>> >>>> Starting a new thread, >>>> >>>> already JIRA filed for the same by Gour, >>>> https://issues.apache.org/jira/browse/YARN-3561 >>>> >>>> Slider stop does not stop the components started by slider, instead it >>>> stops only SliderAM, and even SliderAgents did not receive 'stop' >>> command. >>>> (it happens with debian 7) and tested with 0.70.1 as well as 'develop' >>>> branch code. >>>> >>>> Today I just came across the following mail archive, >>> >>> http://mail-archives.apache.org/mod_mbox/incubator-slider-dev/201503.mbox/%[email protected]%3E >>>> >>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<< >>>> >>>> *What is not implemented is an explicit call to "stop function in the >>>> python scripts". >>>> >>>> What I was referring to that an attempt is made by the Agent to call >>>> stop in the python script >>>> but it is not guaranteed. The reason it is not guaranteed is that the >>>> call to stop() and kill >>>> of the containers by YARN is not co-ordinated. >>>> >>>> In summary, the ability to call stop() functions in the python script >>>> is not implemented. >>>> Its in the plan though.* >>>> >>>> >>>> Does this still exists? >>> >>> >>> the idea of stop|() command is to actually offer a best-effort clean >>> shutdown for containers. Currently the AM just directly tells YARN to >>> destroy a container. The agent doesn't get told, nor does the application >>> (that's implicit from the agent). >>> >>> YARN is expected to "kill" then, if there is no response, "kill -9" the >>> agent process. Which it does for the hosts we test on, linux, OSX and >>> windows. >>> >>> IF something is up with your YARN+debian installation, we believe that it >>> is related to whether those container kill events are coming out from the >>> node manager. >>
