Thanks Chris 2015-01-21 1:12 GMT+08:00 Chris Riccomini <[email protected]>:
> Hey Zhao, > > Yes, I believe you're correct. I've never explicitly tested this failure > case, but what you describe is what should happen. > > Cheers, > Chris > > On 1/16/15 11:27 PM, "Zhao Weinan" <[email protected]> wrote: > > >Hi Chris, > > > >Thanks for replying, your explaination is pretty clear and useful. Next > >time we'll use SIGTERM and check the orphaned child processes. > > > > I did some digging into the source code, found > >org.apache.samza.job.yarn.SamzaAppMaster will exit on it's heatbeat error > >with RM, so if we SIGKILL the RM and NMs all, then the SamzaAppMaster will > >exit itself, leaving samza task containers uknown what happend, am I > >right? > > > > > > > >2015-01-17 3:49 GMT+08:00 Chris Riccomini > ><[email protected]>: > > > >> Hey Zhao, > >> > >> Yes, this is expected behavior. SIGKILL'ing NMs will result in all > >> processes being leaked. This is really an issue with the way Linux > >>handles > >> orphaned child processes. It currently just changes the PPID to 1, and > >> allows the process to continue executing. I did some brief exploration > >>of > >> this here: > >> > >> > >> > >> > http://riccomini.name/posts/linux/2012-09-25-kill-subprocesses-linux-bash > >>/ > >> > >> At LinkedIn, we do several things: > >> > >> 1. Soft kill (SIGTERM) the NMs, to allow the NMs to properly shutdown > >>all > >> containers. > >> 2. Before deploying an NM, we verify that there are no existing > >>processes > >> with "container_*" running with a PPID of 1. > >> > >> You could also verify that *all* container_* processes are dead after > >> SIGKILL'ing the NM, if you want to make extra sure that you haven't > >>leaked > >> containers (which could lead to a double-writing messages in Samza). > >> > >> In practice, once we implemented (1), above, we haven't seen any leaked > >> containers. In a case where an NM dies unexpectedly (e.g. the JVM > >> segfaults, or something) you have to go and clean the leaked processes. > >> > >> Cheers, > >> Chris > >> > >> On 1/16/15 5:20 AM, "Zhao Weinan" <[email protected]> wrote: > >> > >> >Hi, > >> > > >> >We are running some samza task on hadoop yarn 2.4.1. And for some > >>reason, > >> >we restart the whole cluster by SIGKILL RMs and NMs, with samza task > >>left. > >> >Then we found that samza task preserved through the SIGKILL and > >>restart, > >> >which made us trouble to locate task process over clusters. It's that > >> >expected? > >> > > >> >Thanks! > >> > >> > >
