James DeFelice created MESOS-3363:
-------------------------------------

             Summary: custom executor's child process intermittently leaks to 
be a child of slave
                 Key: MESOS-3363
                 URL: https://issues.apache.org/jira/browse/MESOS-3363
             Project: Mesos
          Issue Type: Bug
    Affects Versions: 0.23.0
         Environment: {code}
vagrant@node-1:~$ uname -a
Linux node-1 3.13.0-29-generic #53-Ubuntu SMP Wed Jun 4 21:00:20 UTC 2014 
x86_64 x86_64 x86_64 GNU/Linux
vagrant@node-1:~$ dpkg -l | grep -e mesos
ii  mesos                               0.23.0-1.0.ubuntu1404            amd64  
      Cluster resource manager with efficient resource isolation
{code}
            Reporter: James DeFelice


I was testing a custom executor implementation that manages the life cycle of 
multiple child processes. When the executor is SIGTERM'd it sends a SIGTERM to 
each child process and then self-terminates.

In some cases, the child processes do not die, even through the parent process 
(the custom executor) does. Instead the child procs are re-parented to the 
slave process where they continue to live on indefinitely.

My custom executor is written in Go, and I've found a useful Go/Linux-specific 
setting that allows me to configure a signal to be sent to child procs upon the 
death of the calling thread in the parent. (see 
https://golang.org/src/syscall/exec_linux.go?s=6285:6843#1 for details). I've 
since configured the custom executor to specify that a SIGKILL be sent to all 
child procs upon termination of the executor (parent) process: child procs are 
still sent a SIGTERM upon receipt of such by the executor, but the SIGKILL upon 
executor death now acts as a fallback.

Since implementing the above work-around I have not been able to reproduce the 
problem as previously described. This particular syscall is implemented in very 
few OS's (the Golang hack only supports Linux) so I'm not sure how I'd go about 
something similar on Windows, OS X, BSD, etc.

It seems like mesos should take on the responsibility to ensure that when an 
executor is killed, all of it's child procs are also eventually killed. Given 
that it's an intermittent and hard to reproduce problem, I'm assuming that 
mesos *does* attempt to ensure executor child proc death, but the that the 
implementation is racy/leaky.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to