sure Gour, I have already filed an issue <https://issues.apache.org/jira/browse/HADOOP-11989> for the same in hadoop-common and will submit patch. Thanks for your assistance in helping me debugging this issue.
On Mon, May 18, 2015 at 2:54 AM, Gour Saha <[email protected]> wrote: > Chackra, > This is wonderful. Thanks for debugging this and finding a solution. > > Now that we know it works for Debian, we have to check if this modified > syntax will work for other OSes? If yes, then we can make a machine > independent change. Otherwise we have to make OS specific code change. > > I would suggest you submit a patch for this change on hadoop trunk branch. > You should get credit for it. > > -Gour > > - Sent from my iPhone > > > On May 17, 2015, at 11:05 AM, "Chackravarthy Esakkimuthu" < > [email protected]> wrote: > > > > Gour/Steve, > > > > The issue was because of improper kill command construction by > > DefaultContainerExecutor, and hence kill SIGTERM itself was not issued to > > SliderAgent, hence all the agents as well as components continue to run. > > > > I made one change in Shell.java (hadoop-common) to construct the kill > > command including two hyphens, then now slider stop works properly:) > > > > It was, *kill -signalNo -<process_id>* > > changed as, *kill -signalNo -- -<process_id>* > > > > I have update the same in JIRA as well, > > > > https://issues.apache.org/jira/browse/YARN-3561 > > > > > > Thanks, > > Chackra > > > > > > On Thu, May 14, 2015 at 1:03 PM, Chackravarthy Esakkimuthu < > > [email protected]> wrote: > > > >> sure Gour, would like to take up this task and contribute. Thanks for > the > >> pointers for me to proceed with, I will get in touch with you incase If > I > >> need any more help. > >> > >> And wrt kill -s TERM on main.py processes (tried on both parent and > child > >> process independently), please find the result as follows : > >> > >> In none of the cases, application was killed. > >> > >> *1) Slider app created, and its running (not stopped)* > >> > >> *1.1) kill 'bash main.py' process* > >> > >> - it killed both 'bash main.py' and its 'child main.py' process > >> - but the application process (nimbus) still running > >> > >> > >> SliderAgent.log : > >> > >> *INFO 2015-05-14 12:17:06,591 Controller.py:497 - Component states > >> (result): Expected: 4 and Actual: 5* > >> *ERROR 2015-05-14 12:17:06,596 Controller.py:289 - Got terminateAgent > >> command* > >> *INFO 2015-05-14 12:17:16,597 Controller.py:217 - Terminate agent > command > >> received from AM, stopping the agent ...* > >> *INFO 2015-05-14 12:17:16,597 ProcessHelper.py:39 - Removing pid file* > >> *WARNING 2015-05-14 12:17:16,598 ProcessHelper.py:44 - Unable to remove > >> pid file: [Errno 2] No such file or directory: > >> > '/grid/4/hadoop/yarn/local/usercache/yarn/appcache/application_1431424102217_0003/container_1431424102217_0003_01_000002/infra/run/agent.pid'* > >> *INFO 2015-05-14 12:17:16,598 ProcessHelper.py:46 - Removing temp files* > >> *1.2) kill 'child main.py' process* > >> > >> - it also killed both 'bash main.py' and its 'child main.py' process > >> - but the application process (nimbus) still running > >> > >> > >> SliderAgent.log : > >> > >> *INFO 2015-05-14 12:25:37,990 main.py:56 - signal received, exiting.* > >> *INFO 2015-05-14 12:25:37,990 ProcessHelper.py:39 - Removing pid file* > >> *INFO 2015-05-14 12:25:37,990 ProcessHelper.py:46 - Removing temp files* > >> > >> > >> *2) Slider app created, and its stopped.* > >> > >> *2.1) kill 'bash main.py' process* > >> > >> - it killed only 'bash main.py' and not 'child main.py' process > >> - And application process (nimbus) still running > >> - there is *no logs came in SliderAgent* > >> - And container logs are completely cleared by the time this action is > >> done > >> > >> *2.2) kill 'child main.py' process* > >> > >> - it killed both 'bash main.py' and its 'child main.py' process > >> - And application process (nimbus) still running > >> - And container logs are completely cleared by the time this action is > >> done > >> > >> > >> SliderAgent.log : > >> > >> *INFO 2015-05-14 12:48:25,589 main.py:56 - signal received, exiting.* > >> *INFO 2015-05-14 12:48:25,589 ProcessHelper.py:39 - Removing pid file* > >> *INFO 2015-05-14 12:48:25,590 ProcessHelper.py:46 - Removing temp files* > >> > >> > >>> On Thu, May 14, 2015 at 1:38 AM, Gour Saha <[email protected]> > wrote: > >>> > >>> Hi Chackra, > >>> > >>> You are absolutely right. The workaround that I was planning to work > on, > >>> should be implemented as a neat backup solution, when YARN fails to > >>> shutdown containers (in this and certain other possible scenarios). > >>> > >>> In fact, we had filed a bug long time back along the same lines, > >>> predicting this issue (for another scenario) - > >>> https://issues.apache.org/jira/browse/SLIDER-479 > >>> > >>> As you had expressed interest to contribute to Slider, I was thinking > if > >>> you would have some cycles and be willing to take this up. You can > work on > >>> the develop branch and use SLIDER-479. Slider develop branch is > compatible > >>> with HDP 2.2, so we can easily test the fix in your cluster. > >>> > >>> Let me know, and I can help all along the way. > >>> > >>> In case you have some cycles, here are some pointers that might help > you > >>> to approach this problem - > >>> > >>> 1. Slider has a notion of sending a terminate command to the agent > which > >>> the agent obeys and gracefully brings itself down > >>> 2. In this scenario since Slider AM goes down, the agents can look for > a > >>> node in Zookeeper (when it looses connection with AM) and shut > themselves > >>> down if the node is missing (using the terminate code path or something > >>> more elegant) > >>> 3. Of course this Zookeeper node needs to be created by Slider AM in > the > >>> beginning of create cluster and then deleted just before the AM shuts > down > >>> as part of the stop command (might have to look into YARN pre-emption > >>> scenario, but we can ignore this for now). We do not want to delete > this in > >>> AM failure/restart scenario. > >>> 4. Any other better ideas or elegant solution you can think of > >>> > >>> On a side note, can you test this in debian 7 - > >>> Go to one of the nodes where any of the agents are running (say NIMBUS > or > >>> any other component) and then issue a SIGTERM to the main.py process > (kill > >>> -s TERM <pid>). What do you see in the slider-agent.log after that? > What > >>> happens to all the processes in this container? Are they still running? > >>> > >>> The <pid> is that of the bash main.py process (not the python main.py > >>> child process). > >>> > >>> So if the process is something like this - > >>> yarn 6007 6003 0 19:43 ? 00:00:00 /bin/bash -c python > >>> ./infra/agent/slider-agent/agent/main.py --label > >>> container_1431413628146_0003_01_000002___NIMBUS --zk-quorum > >>> c6408.ambari.apache.org:2181 --zk-reg-path > >>> /registry/users/yarn/services/org-apache-slider/storm_1 > > >>> > /hadoop/yarn/log/application_1431413628146_0003/container_1431413628146_0003_01_000002/slider-agent.out > >>> 2>&1 > >>> > >>> You need to issue - > >>> kill -s TERM 6007 > >>> > >>> -Gour > >>> > >>> On 5/13/15, 1:38 AM, "Chackravarthy Esakkimuthu" < > [email protected] > >>> <mailto:[email protected]>> wrote: > >>> > >>> Thanks for your response steve, > >>> > >>> I was thinking that SliderAgent would receive 'stop' command from > SliderAM > >>> to kill the components spawned by those agents. And yeah this might be > >>> specific to debian installation as others in the group are not facing > this > >>> issue. > >>> > >>> On Tue, May 12, 2015 at 1:50 PM, Steve Loughran < > [email protected] > >>> <mailto:[email protected]>> > >>> wrote: > >>> > >>> > >>>>> On 12 May 2015, at 08:42, Chackravarthy Esakkimuthu < > >>>> [email protected]<mailto:[email protected]>> wrote: > >>>> > >>>> Starting a new thread, > >>>> > >>>> already JIRA filed for the same by Gour, > >>>> https://issues.apache.org/jira/browse/YARN-3561 > >>>> > >>>> Slider stop does not stop the components started by slider, instead it > >>>> stops only SliderAM, and even SliderAgents did not receive 'stop' > >>> command. > >>>> (it happens with debian 7) and tested with 0.70.1 as well as 'develop' > >>>> branch code. > >>>> > >>>> Today I just came across the following mail archive, > >>> > >>> > http://mail-archives.apache.org/mod_mbox/incubator-slider-dev/201503.mbox/%[email protected]%3E > >>>> > >>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<< > >>>> > >>>> *What is not implemented is an explicit call to "stop function in the > >>>> python scripts". > >>>> > >>>> What I was referring to that an attempt is made by the Agent to call > >>>> stop in the python script > >>>> but it is not guaranteed. The reason it is not guaranteed is that the > >>>> call to stop() and kill > >>>> of the containers by YARN is not co-ordinated. > >>>> > >>>> In summary, the ability to call stop() functions in the python script > >>>> is not implemented. > >>>> Its in the plan though.* > >>>> > >>>> > >>>> Does this still exists? > >>> > >>> > >>> the idea of stop|() command is to actually offer a best-effort clean > >>> shutdown for containers. Currently the AM just directly tells YARN to > >>> destroy a container. The agent doesn't get told, nor does the > application > >>> (that's implicit from the agent). > >>> > >>> YARN is expected to "kill" then, if there is no response, "kill -9" the > >>> agent process. Which it does for the hosts we test on, linux, OSX and > >>> windows. > >>> > >>> IF something is up with your YARN+debian installation, we believe that > it > >>> is related to whether those container kill events are coming out from > the > >>> node manager. > >> >
