Re: Slider stop not working

Chackravarthy Esakkimuthu Mon, 18 May 2015 01:30:22 -0700

sure Gour, I have already filed an issue
<https://issues.apache.org/jira/browse/HADOOP-11989> for the same in
hadoop-common and will submit patch. Thanks for your assistance in helping
me debugging this issue.


On Mon, May 18, 2015 at 2:54 AM, Gour Saha <[email protected]> wrote:

> Chackra,
> This is wonderful. Thanks for debugging this and finding a solution.
>
> Now that we know it works for Debian, we have to check if this modified
> syntax will work for other OSes? If yes, then we can make a machine
> independent change. Otherwise we have to make OS specific code change.
>
> I would suggest you submit a patch for this change on hadoop trunk branch.
> You should get credit for it.
>
> -Gour
>
> - Sent from my iPhone
>
> > On May 17, 2015, at 11:05 AM, "Chackravarthy Esakkimuthu" <
> [email protected]> wrote:
> >
> > Gour/Steve,
> >
> > The issue was because of improper kill command construction by
> > DefaultContainerExecutor, and hence kill SIGTERM itself was not issued to
> > SliderAgent, hence all the agents as well as components continue to run.
> >
> > I made one change in Shell.java (hadoop-common)  to construct the kill
> > command including two hyphens, then now slider stop works properly:)
> >
> > It was, *kill -signalNo -<process_id>*
> > changed as,   *kill -signalNo -- -<process_id>*
> >
> > I have update the same in JIRA as well,
> >
> > https://issues.apache.org/jira/browse/YARN-3561
> >
> >
> > Thanks,
> > Chackra
> >
> >
> > On Thu, May 14, 2015 at 1:03 PM, Chackravarthy Esakkimuthu <
> > [email protected]> wrote:
> >
> >> sure Gour, would like to take up this task and contribute. Thanks for
> the
> >> pointers for me to proceed with, I will get in touch with you incase If
> I
> >> need any more help.
> >>
> >> And wrt kill -s TERM on main.py processes (tried on both parent and
> child
> >> process independently), please find the result as follows :
> >>
> >> In none of the cases, application was killed.
> >>
> >> *1) Slider app created, and its running (not stopped)*
> >>
> >> *1.1) kill 'bash main.py' process*
> >>
> >>   -  it killed both 'bash main.py' and its 'child main.py' process
> >>   -  but the application process (nimbus) still running
> >>
> >>
> >> SliderAgent.log :
> >>
> >> *INFO 2015-05-14 12:17:06,591 Controller.py:497 - Component states
> >> (result): Expected: 4 and Actual: 5*
> >> *ERROR 2015-05-14 12:17:06,596 Controller.py:289 - Got terminateAgent
> >> command*
> >> *INFO 2015-05-14 12:17:16,597 Controller.py:217 - Terminate agent
> command
> >> received from AM, stopping the agent ...*
> >> *INFO 2015-05-14 12:17:16,597 ProcessHelper.py:39 - Removing pid file*
> >> *WARNING 2015-05-14 12:17:16,598 ProcessHelper.py:44 - Unable to remove
> >> pid file: [Errno 2] No such file or directory:
> >>
> '/grid/4/hadoop/yarn/local/usercache/yarn/appcache/application_1431424102217_0003/container_1431424102217_0003_01_000002/infra/run/agent.pid'*
> >> *INFO 2015-05-14 12:17:16,598 ProcessHelper.py:46 - Removing temp files*
> >> *1.2) kill 'child main.py' process*
> >>
> >>   - it also killed both 'bash main.py' and its 'child main.py' process
> >>   - but the application process (nimbus) still running
> >>
> >>
> >> SliderAgent.log :
> >>
> >> *INFO 2015-05-14 12:25:37,990 main.py:56 - signal received, exiting.*
> >> *INFO 2015-05-14 12:25:37,990 ProcessHelper.py:39 - Removing pid file*
> >> *INFO 2015-05-14 12:25:37,990 ProcessHelper.py:46 - Removing temp files*
> >>
> >>
> >> *2) Slider app created, and its stopped.*
> >>
> >> *2.1) kill 'bash main.py' process*
> >>
> >>   - it killed only 'bash main.py' and not 'child main.py' process
> >>   - And application process (nimbus) still running
> >>   - there is *no logs came in SliderAgent*
> >>   - And container logs are completely cleared by the time this action is
> >>   done
> >>
> >> *2.2) kill 'child main.py' process*
> >>
> >>   - it killed both 'bash main.py' and its 'child main.py' process
> >>   - And application process (nimbus) still running
> >>   - And container logs are completely cleared by the time this action is
> >>   done
> >>
> >>
> >> SliderAgent.log :
> >>
> >> *INFO 2015-05-14 12:48:25,589 main.py:56 - signal received, exiting.*
> >> *INFO 2015-05-14 12:48:25,589 ProcessHelper.py:39 - Removing pid file*
> >> *INFO 2015-05-14 12:48:25,590 ProcessHelper.py:46 - Removing temp files*
> >>
> >>
> >>> On Thu, May 14, 2015 at 1:38 AM, Gour Saha <[email protected]>
> wrote:
> >>>
> >>> Hi Chackra,
> >>>
> >>> You are absolutely right. The workaround that I was planning to work
> on,
> >>> should be implemented as a neat backup solution, when YARN fails to
> >>> shutdown containers (in this and certain other possible scenarios).
> >>>
> >>> In fact, we had filed a bug long time back along the same lines,
> >>> predicting this issue (for another scenario) -
> >>> https://issues.apache.org/jira/browse/SLIDER-479
> >>>
> >>> As you had expressed interest to contribute to Slider, I was thinking
> if
> >>> you would have some cycles and be willing to take this up. You can
> work on
> >>> the develop branch and use SLIDER-479. Slider develop branch is
> compatible
> >>> with HDP 2.2, so we can easily test the fix in your cluster.
> >>>
> >>> Let me know, and I can help all along the way.
> >>>
> >>> In case you have some cycles, here are some pointers that might help
> you
> >>> to approach this problem -
> >>>
> >>> 1. Slider has a notion of sending a terminate command to the agent
> which
> >>> the agent obeys and gracefully brings itself down
> >>> 2. In this scenario since Slider AM goes down, the agents can look for
> a
> >>> node in Zookeeper (when it looses connection with AM) and shut
> themselves
> >>> down if the node is missing (using the terminate code path or something
> >>> more elegant)
> >>> 3. Of course this Zookeeper node needs to be created by Slider AM in
> the
> >>> beginning of create cluster and then deleted just before the AM shuts
> down
> >>> as part of the stop command (might have to look into YARN pre-emption
> >>> scenario, but we can ignore this for now). We do not want to delete
> this in
> >>> AM failure/restart scenario.
> >>> 4. Any other better ideas or elegant solution you can think of
> >>>
> >>> On a side note, can you test this in debian 7 -
> >>> Go to one of the nodes where any of the agents are running (say NIMBUS
> or
> >>> any other component) and then issue a SIGTERM to the main.py process
> (kill
> >>> -s TERM <pid>). What do you see in the slider-agent.log after that?
> What
> >>> happens to all the processes in this container? Are they still running?
> >>>
> >>> The <pid> is that of the bash main.py process (not the python main.py
> >>> child process).
> >>>
> >>> So if the process is something like this -
> >>> yarn      6007  6003  0 19:43 ?        00:00:00 /bin/bash -c python
> >>> ./infra/agent/slider-agent/agent/main.py --label
> >>> container_1431413628146_0003_01_000002___NIMBUS --zk-quorum
> >>> c6408.ambari.apache.org:2181 --zk-reg-path
> >>> /registry/users/yarn/services/org-apache-slider/storm_1 >
> >>>
> /hadoop/yarn/log/application_1431413628146_0003/container_1431413628146_0003_01_000002/slider-agent.out
> >>> 2>&1
> >>>
> >>> You need to issue -
> >>> kill -s TERM 6007
> >>>
> >>> -Gour
> >>>
> >>> On 5/13/15, 1:38 AM, "Chackravarthy Esakkimuthu" <
> [email protected]
> >>> <mailto:[email protected]>> wrote:
> >>>
> >>> Thanks for your response steve,
> >>>
> >>> I was thinking that SliderAgent would receive 'stop' command from
> SliderAM
> >>> to kill the components spawned by those agents. And yeah this might be
> >>> specific to debian installation as others in the group are not facing
> this
> >>> issue.
> >>>
> >>> On Tue, May 12, 2015 at 1:50 PM, Steve Loughran <
> [email protected]
> >>> <mailto:[email protected]>>
> >>> wrote:
> >>>
> >>>
> >>>>> On 12 May 2015, at 08:42, Chackravarthy Esakkimuthu <
> >>>> [email protected]<mailto:[email protected]>> wrote:
> >>>>
> >>>> Starting a new thread,
> >>>>
> >>>> already JIRA filed for the same by Gour,
> >>>> https://issues.apache.org/jira/browse/YARN-3561
> >>>>
> >>>> Slider stop does not stop the components started by slider, instead it
> >>>> stops only SliderAM, and even SliderAgents did not receive 'stop'
> >>> command.
> >>>> (it happens with debian 7) and tested with 0.70.1 as well as 'develop'
> >>>> branch code.
> >>>>
> >>>> Today I just came across the following mail archive,
> >>>
> >>>
> http://mail-archives.apache.org/mod_mbox/incubator-slider-dev/201503.mbox/%[email protected]%3E
> >>>>
> >>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> >>>>
> >>>> *What is not implemented is an explicit call to "stop function in the
> >>>> python scripts".
> >>>>
> >>>> What I was referring to that an attempt is made by the Agent to call
> >>>> stop in the python script
> >>>> but it is not guaranteed. The reason it is not guaranteed is that the
> >>>> call to stop() and kill
> >>>> of the containers by YARN is not co-ordinated.
> >>>>
> >>>> In summary, the ability to call stop() functions in the python script
> >>>> is not implemented.
> >>>> Its in the plan though.*
> >>>>
> >>>>
> >>>> Does this still exists?
> >>>
> >>>
> >>> the idea of stop|() command is to actually offer a best-effort clean
> >>> shutdown for containers. Currently the AM just directly tells YARN to
> >>> destroy a container. The agent doesn't get told, nor does the
> application
> >>> (that's implicit from the agent).
> >>>
> >>> YARN is expected to "kill" then, if there is no response, "kill -9" the
> >>> agent process. Which it does for the hosts we test on, linux, OSX and
> >>> windows.
> >>>
> >>> IF something is up with your YARN+debian installation, we believe that
> it
> >>> is related to whether those container kill events are coming out from
> the
> >>> node manager.
> >>
>

Re: Slider stop not working

Reply via email to