Thanks Chris

2015-01-21 1:12 GMT+08:00 Chris Riccomini <[email protected]>:

> Hey Zhao,
>
> Yes, I believe you're correct. I've never explicitly tested this failure
> case, but what you describe is what should happen.
>
> Cheers,
> Chris
>
> On 1/16/15 11:27 PM, "Zhao Weinan" <[email protected]> wrote:
>
> >Hi Chris,
> >
> >Thanks for replying, your explaination is pretty clear and useful. Next
> >time we'll use SIGTERM and check the orphaned child processes.
> >
> > I did some digging into the source code, found
> >org.apache.samza.job.yarn.SamzaAppMaster will exit on it's heatbeat error
> >with RM, so if we SIGKILL the RM and NMs all, then the SamzaAppMaster will
> >exit itself, leaving samza task containers uknown what happend, am I
> >right?
> >
> >
> >
> >2015-01-17 3:49 GMT+08:00 Chris Riccomini
> ><[email protected]>:
> >
> >> Hey Zhao,
> >>
> >> Yes, this is expected behavior. SIGKILL'ing NMs will result in all
> >> processes being leaked. This is really an issue with the way Linux
> >>handles
> >> orphaned child processes. It currently just changes the PPID to 1, and
> >> allows the process to continue executing. I did some brief exploration
> >>of
> >> this here:
> >>
> >>
> >>
> >>
> http://riccomini.name/posts/linux/2012-09-25-kill-subprocesses-linux-bash
> >>/
> >>
> >> At LinkedIn, we do several things:
> >>
> >> 1. Soft kill (SIGTERM) the NMs, to allow the NMs to properly shutdown
> >>all
> >> containers.
> >> 2. Before deploying an NM, we verify that there are no existing
> >>processes
> >> with "container_*" running with a PPID of 1.
> >>
> >> You could also verify that *all* container_* processes are dead after
> >> SIGKILL'ing the NM, if you want to make extra sure that you haven't
> >>leaked
> >> containers (which could lead to a double-writing messages in Samza).
> >>
> >> In practice, once we implemented (1), above, we haven't seen any leaked
> >> containers. In a case where an NM dies unexpectedly (e.g. the JVM
> >> segfaults, or something) you have to go and clean the leaked processes.
> >>
> >> Cheers,
> >> Chris
> >>
> >> On 1/16/15 5:20 AM, "Zhao Weinan" <[email protected]> wrote:
> >>
> >> >Hi,
> >> >
> >> >We are running some samza task on hadoop yarn 2.4.1. And for some
> >>reason,
> >> >we restart the whole cluster by SIGKILL RMs and NMs, with samza task
> >>left.
> >> >Then we found that samza task preserved through the SIGKILL and
> >>restart,
> >> >which made us trouble to locate task process over clusters. It's that
> >> >expected?
> >> >
> >> >Thanks!
> >>
> >>
>
>

Reply via email to