Re: Slider stop not working

Gour Saha Wed, 13 May 2015 13:10:22 -0700

Hi Chackra,

You are absolutely right. The workaround that I was planning to work on, should 
be implemented as a neat backup solution, when YARN fails to shutdown 
containers (in this and certain other possible scenarios).

In fact, we had filed a bug long time back along the same lines, predicting 
this issue (for another scenario) - 
https://issues.apache.org/jira/browse/SLIDER-479

As you had expressed interest to contribute to Slider, I was thinking if you 
would have some cycles and be willing to take this up. You can work on the 
develop branch and use SLIDER-479. Slider develop branch is compatible with HDP 
2.2, so we can easily test the fix in your cluster.

Let me know, and I can help all along the way.

In case you have some cycles, here are some pointers that might help you to 
approach this problem -

1. Slider has a notion of sending a terminate command to the agent which the 
agent obeys and gracefully brings itself down
2. In this scenario since Slider AM goes down, the agents can look for a node 
in Zookeeper (when it looses connection with AM) and shut themselves down if 
the node is missing (using the terminate code path or something more elegant)
3. Of course this Zookeeper node needs to be created by Slider AM in the 
beginning of create cluster and then deleted just before the AM shuts down as 
part of the stop command (might have to look into YARN pre-emption scenario, 
but we can ignore this for now). We do not want to delete this in AM 
failure/restart scenario.
4. Any other better ideas or elegant solution you can think of

On a side note, can you test this in debian 7 -
Go to one of the nodes where any of the agents are running (say NIMBUS or any 
other component) and then issue a SIGTERM to the main.py process (kill -s TERM 
<pid>). What do you see in the slider-agent.log after that? What happens to all 
the processes in this container? Are they still running?

The <pid> is that of the bash main.py process (not the python main.py child 
process).

So if the process is something like this -
yarn      6007  6003  0 19:43 ?        00:00:00 /bin/bash -c python 
./infra/agent/slider-agent/agent/main.py --label 
container_1431413628146_0003_01_000002___NIMBUS --zk-quorum 
c6408.ambari.apache.org:2181 --zk-reg-path 
/registry/users/yarn/services/org-apache-slider/storm_1 > 
/hadoop/yarn/log/application_1431413628146_0003/container_1431413628146_0003_01_000002/slider-agent.out
 2>&1

You need to issue -
kill -s TERM 6007

-Gour

On 5/13/15, 1:38 AM, "Chackravarthy Esakkimuthu" 
<[email protected]<mailto:[email protected]>> wrote:

Thanks for your response steve,

I was thinking that SliderAgent would receive 'stop' command from SliderAM
to kill the components spawned by those agents. And yeah this might be
specific to debian installation as others in the group are not facing this
issue.

On Tue, May 12, 2015 at 1:50 PM, Steve Loughran 
<[email protected]<mailto:[email protected]>>
wrote:

> On 12 May 2015, at 08:42, Chackravarthy Esakkimuthu <
[email protected]<mailto:[email protected]>> wrote:
>
> Starting a new thread,
>
> already JIRA filed for the same by Gour,
> https://issues.apache.org/jira/browse/YARN-3561
>
> Slider stop does not stop the components started by slider, instead it
> stops only SliderAM, and even SliderAgents did not receive 'stop'
command.
> (it happens with debian 7) and tested with 0.70.1 as well as 'develop'
> branch code.
>
> Today I just came across the following mail archive,
>
>
http://mail-archives.apache.org/mod_mbox/incubator-slider-dev/201503.mbox/%[email protected]%3E
>
> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>
> *What is not implemented is an explicit call to "stop function in the
> python scripts".
>
> What I was referring to that an attempt is made by the Agent to call
> stop in the python script
> but it is not guaranteed. The reason it is not guaranteed is that the
> call to stop() and kill
> of the containers by YARN is not co-ordinated.
>
> In summary, the ability to call stop() functions in the python script
> is not implemented.
> Its in the plan though.*
>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>
> Does this still exists?

the idea of stop|() command is to actually offer a best-effort clean
shutdown for containers. Currently the AM just directly tells YARN to
destroy a container. The agent doesn't get told, nor does the application
(that's implicit from the agent).

YARN is expected to "kill" then, if there is no response, "kill -9" the
agent process. Which it does for the hosts we test on, linux, OSX and
windows.

IF something is up with your YARN+debian installation, we believe that it
is related to whether those container kill events are coming out from the
node manager.

Re: Slider stop not working

Reply via email to