Chackra,
This is wonderful. Thanks for debugging this and finding a solution. 

Now that we know it works for Debian, we have to check if this modified syntax 
will work for other OSes? If yes, then we can make a machine independent 
change. Otherwise we have to make OS specific code change. 

I would suggest you submit a patch for this change on hadoop trunk branch. You 
should get credit for it. 

-Gour

- Sent from my iPhone

> On May 17, 2015, at 11:05 AM, "Chackravarthy Esakkimuthu" 
> <[email protected]> wrote:
> 
> Gour/Steve,
> 
> The issue was because of improper kill command construction by
> DefaultContainerExecutor, and hence kill SIGTERM itself was not issued to
> SliderAgent, hence all the agents as well as components continue to run.
> 
> I made one change in Shell.java (hadoop-common)  to construct the kill
> command including two hyphens, then now slider stop works properly:)
> 
> It was, *kill -signalNo -<process_id>*
> changed as,   *kill -signalNo -- -<process_id>*
> 
> I have update the same in JIRA as well,
> 
> https://issues.apache.org/jira/browse/YARN-3561
> 
> 
> Thanks,
> Chackra
> 
> 
> On Thu, May 14, 2015 at 1:03 PM, Chackravarthy Esakkimuthu <
> [email protected]> wrote:
> 
>> sure Gour, would like to take up this task and contribute. Thanks for the
>> pointers for me to proceed with, I will get in touch with you incase If I
>> need any more help.
>> 
>> And wrt kill -s TERM on main.py processes (tried on both parent and child
>> process independently), please find the result as follows :
>> 
>> In none of the cases, application was killed.
>> 
>> *1) Slider app created, and its running (not stopped)*
>> 
>> *1.1) kill 'bash main.py' process*
>> 
>>   -  it killed both 'bash main.py' and its 'child main.py' process
>>   -  but the application process (nimbus) still running
>> 
>> 
>> SliderAgent.log :
>> 
>> *INFO 2015-05-14 12:17:06,591 Controller.py:497 - Component states
>> (result): Expected: 4 and Actual: 5*
>> *ERROR 2015-05-14 12:17:06,596 Controller.py:289 - Got terminateAgent
>> command*
>> *INFO 2015-05-14 12:17:16,597 Controller.py:217 - Terminate agent command
>> received from AM, stopping the agent ...*
>> *INFO 2015-05-14 12:17:16,597 ProcessHelper.py:39 - Removing pid file*
>> *WARNING 2015-05-14 12:17:16,598 ProcessHelper.py:44 - Unable to remove
>> pid file: [Errno 2] No such file or directory:
>> '/grid/4/hadoop/yarn/local/usercache/yarn/appcache/application_1431424102217_0003/container_1431424102217_0003_01_000002/infra/run/agent.pid'*
>> *INFO 2015-05-14 12:17:16,598 ProcessHelper.py:46 - Removing temp files*
>> *1.2) kill 'child main.py' process*
>> 
>>   - it also killed both 'bash main.py' and its 'child main.py' process
>>   - but the application process (nimbus) still running
>> 
>> 
>> SliderAgent.log :
>> 
>> *INFO 2015-05-14 12:25:37,990 main.py:56 - signal received, exiting.*
>> *INFO 2015-05-14 12:25:37,990 ProcessHelper.py:39 - Removing pid file*
>> *INFO 2015-05-14 12:25:37,990 ProcessHelper.py:46 - Removing temp files*
>> 
>> 
>> *2) Slider app created, and its stopped.*
>> 
>> *2.1) kill 'bash main.py' process*
>> 
>>   - it killed only 'bash main.py' and not 'child main.py' process
>>   - And application process (nimbus) still running
>>   - there is *no logs came in SliderAgent*
>>   - And container logs are completely cleared by the time this action is
>>   done
>> 
>> *2.2) kill 'child main.py' process*
>> 
>>   - it killed both 'bash main.py' and its 'child main.py' process
>>   - And application process (nimbus) still running
>>   - And container logs are completely cleared by the time this action is
>>   done
>> 
>> 
>> SliderAgent.log :
>> 
>> *INFO 2015-05-14 12:48:25,589 main.py:56 - signal received, exiting.*
>> *INFO 2015-05-14 12:48:25,589 ProcessHelper.py:39 - Removing pid file*
>> *INFO 2015-05-14 12:48:25,590 ProcessHelper.py:46 - Removing temp files*
>> 
>> 
>>> On Thu, May 14, 2015 at 1:38 AM, Gour Saha <[email protected]> wrote:
>>> 
>>> Hi Chackra,
>>> 
>>> You are absolutely right. The workaround that I was planning to work on,
>>> should be implemented as a neat backup solution, when YARN fails to
>>> shutdown containers (in this and certain other possible scenarios).
>>> 
>>> In fact, we had filed a bug long time back along the same lines,
>>> predicting this issue (for another scenario) -
>>> https://issues.apache.org/jira/browse/SLIDER-479
>>> 
>>> As you had expressed interest to contribute to Slider, I was thinking if
>>> you would have some cycles and be willing to take this up. You can work on
>>> the develop branch and use SLIDER-479. Slider develop branch is compatible
>>> with HDP 2.2, so we can easily test the fix in your cluster.
>>> 
>>> Let me know, and I can help all along the way.
>>> 
>>> In case you have some cycles, here are some pointers that might help you
>>> to approach this problem -
>>> 
>>> 1. Slider has a notion of sending a terminate command to the agent which
>>> the agent obeys and gracefully brings itself down
>>> 2. In this scenario since Slider AM goes down, the agents can look for a
>>> node in Zookeeper (when it looses connection with AM) and shut themselves
>>> down if the node is missing (using the terminate code path or something
>>> more elegant)
>>> 3. Of course this Zookeeper node needs to be created by Slider AM in the
>>> beginning of create cluster and then deleted just before the AM shuts down
>>> as part of the stop command (might have to look into YARN pre-emption
>>> scenario, but we can ignore this for now). We do not want to delete this in
>>> AM failure/restart scenario.
>>> 4. Any other better ideas or elegant solution you can think of
>>> 
>>> On a side note, can you test this in debian 7 -
>>> Go to one of the nodes where any of the agents are running (say NIMBUS or
>>> any other component) and then issue a SIGTERM to the main.py process (kill
>>> -s TERM <pid>). What do you see in the slider-agent.log after that? What
>>> happens to all the processes in this container? Are they still running?
>>> 
>>> The <pid> is that of the bash main.py process (not the python main.py
>>> child process).
>>> 
>>> So if the process is something like this -
>>> yarn      6007  6003  0 19:43 ?        00:00:00 /bin/bash -c python
>>> ./infra/agent/slider-agent/agent/main.py --label
>>> container_1431413628146_0003_01_000002___NIMBUS --zk-quorum
>>> c6408.ambari.apache.org:2181 --zk-reg-path
>>> /registry/users/yarn/services/org-apache-slider/storm_1 >
>>> /hadoop/yarn/log/application_1431413628146_0003/container_1431413628146_0003_01_000002/slider-agent.out
>>> 2>&1
>>> 
>>> You need to issue -
>>> kill -s TERM 6007
>>> 
>>> -Gour
>>> 
>>> On 5/13/15, 1:38 AM, "Chackravarthy Esakkimuthu" <[email protected]
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Thanks for your response steve,
>>> 
>>> I was thinking that SliderAgent would receive 'stop' command from SliderAM
>>> to kill the components spawned by those agents. And yeah this might be
>>> specific to debian installation as others in the group are not facing this
>>> issue.
>>> 
>>> On Tue, May 12, 2015 at 1:50 PM, Steve Loughran <[email protected]
>>> <mailto:[email protected]>>
>>> wrote:
>>> 
>>> 
>>>>> On 12 May 2015, at 08:42, Chackravarthy Esakkimuthu <
>>>> [email protected]<mailto:[email protected]>> wrote:
>>>> 
>>>> Starting a new thread,
>>>> 
>>>> already JIRA filed for the same by Gour,
>>>> https://issues.apache.org/jira/browse/YARN-3561
>>>> 
>>>> Slider stop does not stop the components started by slider, instead it
>>>> stops only SliderAM, and even SliderAgents did not receive 'stop'
>>> command.
>>>> (it happens with debian 7) and tested with 0.70.1 as well as 'develop'
>>>> branch code.
>>>> 
>>>> Today I just came across the following mail archive,
>>> 
>>> http://mail-archives.apache.org/mod_mbox/incubator-slider-dev/201503.mbox/%[email protected]%3E
>>>> 
>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>>>> 
>>>> *What is not implemented is an explicit call to "stop function in the
>>>> python scripts".
>>>> 
>>>> What I was referring to that an attempt is made by the Agent to call
>>>> stop in the python script
>>>> but it is not guaranteed. The reason it is not guaranteed is that the
>>>> call to stop() and kill
>>>> of the containers by YARN is not co-ordinated.
>>>> 
>>>> In summary, the ability to call stop() functions in the python script
>>>> is not implemented.
>>>> Its in the plan though.*
>>>> 
>>>> 
>>>> Does this still exists?
>>> 
>>> 
>>> the idea of stop|() command is to actually offer a best-effort clean
>>> shutdown for containers. Currently the AM just directly tells YARN to
>>> destroy a container. The agent doesn't get told, nor does the application
>>> (that's implicit from the agent).
>>> 
>>> YARN is expected to "kill" then, if there is no response, "kill -9" the
>>> agent process. Which it does for the hosts we test on, linux, OSX and
>>> windows.
>>> 
>>> IF something is up with your YARN+debian installation, we believe that it
>>> is related to whether those container kill events are coming out from the
>>> node manager.
>> 

Reply via email to