Hey Zhao, Yes, this is expected behavior. SIGKILL'ing NMs will result in all processes being leaked. This is really an issue with the way Linux handles orphaned child processes. It currently just changes the PPID to 1, and allows the process to continue executing. I did some brief exploration of this here:
http://riccomini.name/posts/linux/2012-09-25-kill-subprocesses-linux-bash/ At LinkedIn, we do several things: 1. Soft kill (SIGTERM) the NMs, to allow the NMs to properly shutdown all containers. 2. Before deploying an NM, we verify that there are no existing processes with "container_*" running with a PPID of 1. You could also verify that *all* container_* processes are dead after SIGKILL'ing the NM, if you want to make extra sure that you haven't leaked containers (which could lead to a double-writing messages in Samza). In practice, once we implemented (1), above, we haven't seen any leaked containers. In a case where an NM dies unexpectedly (e.g. the JVM segfaults, or something) you have to go and clean the leaked processes. Cheers, Chris On 1/16/15 5:20 AM, "Zhao Weinan" <[email protected]> wrote: >Hi, > >We are running some samza task on hadoop yarn 2.4.1. And for some reason, >we restart the whole cluster by SIGKILL RMs and NMs, with samza task left. >Then we found that samza task preserved through the SIGKILL and restart, >which made us trouble to locate task process over clusters. It's that >expected? > >Thanks!
