Re: Slider stop not working

Chackravarthy Esakkimuthu Thu, 14 May 2015 00:35:06 -0700

sure Gour, would like to take up this task and contribute. Thanks for the
pointers for me to proceed with, I will get in touch with you incase If I
need any more help.


And wrt kill -s TERM on main.py processes (tried on both parent and child
process independently), please find the result as follows :

In none of the cases, application was killed.

*1) Slider app created, and its running (not stopped)*

*1.1) kill 'bash main.py' process*

   -  it killed both 'bash main.py' and its 'child main.py' process
   -  but the application process (nimbus) still running


SliderAgent.log :

*INFO 2015-05-14 12:17:06,591 Controller.py:497 - Component states
(result): Expected: 4 and Actual: 5*
*ERROR 2015-05-14 12:17:06,596 Controller.py:289 - Got terminateAgent
command*
*INFO 2015-05-14 12:17:16,597 Controller.py:217 - Terminate agent command
received from AM, stopping the agent ...*
*INFO 2015-05-14 12:17:16,597 ProcessHelper.py:39 - Removing pid file*
*WARNING 2015-05-14 12:17:16,598 ProcessHelper.py:44 - Unable to remove pid
file: [Errno 2] No such file or directory:
'/grid/4/hadoop/yarn/local/usercache/yarn/appcache/application_1431424102217_0003/container_1431424102217_0003_01_000002/infra/run/agent.pid'*
*INFO 2015-05-14 12:17:16,598 ProcessHelper.py:46 - Removing temp files*
*1.2) kill 'child main.py' process*

   - it also killed both 'bash main.py' and its 'child main.py' process
   - but the application process (nimbus) still running


SliderAgent.log :

*INFO 2015-05-14 12:25:37,990 main.py:56 - signal received, exiting.*
*INFO 2015-05-14 12:25:37,990 ProcessHelper.py:39 - Removing pid file*
*INFO 2015-05-14 12:25:37,990 ProcessHelper.py:46 - Removing temp files*


*2) Slider app created, and its stopped.*

*2.1) kill 'bash main.py' process*

   - it killed only 'bash main.py' and not 'child main.py' process
   - And application process (nimbus) still running
   - there is *no logs came in SliderAgent*
   - And container logs are completely cleared by the time this action is
   done

*2.2) kill 'child main.py' process*

   - it killed both 'bash main.py' and its 'child main.py' process
   - And application process (nimbus) still running
   - And container logs are completely cleared by the time this action is
   done


SliderAgent.log :

*INFO 2015-05-14 12:48:25,589 main.py:56 - signal received, exiting.*
*INFO 2015-05-14 12:48:25,589 ProcessHelper.py:39 - Removing pid file*
*INFO 2015-05-14 12:48:25,590 ProcessHelper.py:46 - Removing temp files*


On Thu, May 14, 2015 at 1:38 AM, Gour Saha <[email protected]> wrote:

> Hi Chackra,
>
> You are absolutely right. The workaround that I was planning to work on,
> should be implemented as a neat backup solution, when YARN fails to
> shutdown containers (in this and certain other possible scenarios).
>
> In fact, we had filed a bug long time back along the same lines,
> predicting this issue (for another scenario) -
> https://issues.apache.org/jira/browse/SLIDER-479
>
> As you had expressed interest to contribute to Slider, I was thinking if
> you would have some cycles and be willing to take this up. You can work on
> the develop branch and use SLIDER-479. Slider develop branch is compatible
> with HDP 2.2, so we can easily test the fix in your cluster.
>
> Let me know, and I can help all along the way.
>
> In case you have some cycles, here are some pointers that might help you
> to approach this problem -
>
> 1. Slider has a notion of sending a terminate command to the agent which
> the agent obeys and gracefully brings itself down
> 2. In this scenario since Slider AM goes down, the agents can look for a
> node in Zookeeper (when it looses connection with AM) and shut themselves
> down if the node is missing (using the terminate code path or something
> more elegant)
> 3. Of course this Zookeeper node needs to be created by Slider AM in the
> beginning of create cluster and then deleted just before the AM shuts down
> as part of the stop command (might have to look into YARN pre-emption
> scenario, but we can ignore this for now). We do not want to delete this in
> AM failure/restart scenario.
> 4. Any other better ideas or elegant solution you can think of
>
> On a side note, can you test this in debian 7 -
> Go to one of the nodes where any of the agents are running (say NIMBUS or
> any other component) and then issue a SIGTERM to the main.py process (kill
> -s TERM <pid>). What do you see in the slider-agent.log after that? What
> happens to all the processes in this container? Are they still running?
>
> The <pid> is that of the bash main.py process (not the python main.py
> child process).
>
> So if the process is something like this -
> yarn      6007  6003  0 19:43 ?        00:00:00 /bin/bash -c python
> ./infra/agent/slider-agent/agent/main.py --label
> container_1431413628146_0003_01_000002___NIMBUS --zk-quorum
> c6408.ambari.apache.org:2181 --zk-reg-path
> /registry/users/yarn/services/org-apache-slider/storm_1 >
> /hadoop/yarn/log/application_1431413628146_0003/container_1431413628146_0003_01_000002/slider-agent.out
> 2>&1
>
> You need to issue -
> kill -s TERM 6007
>
> -Gour
>
> On 5/13/15, 1:38 AM, "Chackravarthy Esakkimuthu" <[email protected]
> <mailto:[email protected]>> wrote:
>
> Thanks for your response steve,
>
> I was thinking that SliderAgent would receive 'stop' command from SliderAM
> to kill the components spawned by those agents. And yeah this might be
> specific to debian installation as others in the group are not facing this
> issue.
>
> On Tue, May 12, 2015 at 1:50 PM, Steve Loughran <[email protected]
> <mailto:[email protected]>>
> wrote:
>
>
> > On 12 May 2015, at 08:42, Chackravarthy Esakkimuthu <
> [email protected]<mailto:[email protected]>> wrote:
> >
> > Starting a new thread,
> >
> > already JIRA filed for the same by Gour,
> > https://issues.apache.org/jira/browse/YARN-3561
> >
> > Slider stop does not stop the components started by slider, instead it
> > stops only SliderAM, and even SliderAgents did not receive 'stop'
> command.
> > (it happens with debian 7) and tested with 0.70.1 as well as 'develop'
> > branch code.
> >
> > Today I just came across the following mail archive,
> >
> >
>
> http://mail-archives.apache.org/mod_mbox/incubator-slider-dev/201503.mbox/%[email protected]%3E
> >
> > <<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> >
> > *What is not implemented is an explicit call to "stop function in the
> > python scripts".
> >
> > What I was referring to that an attempt is made by the Agent to call
> > stop in the python script
> > but it is not guaranteed. The reason it is not guaranteed is that the
> > call to stop() and kill
> > of the containers by YARN is not co-ordinated.
> >
> > In summary, the ability to call stop() functions in the python script
> > is not implemented.
> > Its in the plan though.*
> >
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >
> > Does this still exists?
>
>
> the idea of stop|() command is to actually offer a best-effort clean
> shutdown for containers. Currently the AM just directly tells YARN to
> destroy a container. The agent doesn't get told, nor does the application
> (that's implicit from the agent).
>
> YARN is expected to "kill" then, if there is no response, "kill -9" the
> agent process. Which it does for the hosts we test on, linux, OSX and
> windows.
>
> IF something is up with your YARN+debian installation, we believe that it
> is related to whether those container kill events are coming out from the
> node manager.
>
>
>

Re: Slider stop not working

Reply via email to