[ 
https://issues.apache.org/jira/browse/MESOS-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14163607#comment-14163607
 ] 

Alexander Rukletsov commented on MESOS-1871:
--------------------------------------------

It looks like this issue consists of two parts.

1. If CommandExecutor starts a task via {{sh -c}}, we reap the "wrong" process. 
Instead of reaping {{sh -c}} it makes sense to monitor and reap the actual task 
process, or the whole process tree rooted at {{sh -c}}, i.e. call {{reaped()}} 
only when all process in the tree terminate. Otherwise—as illustrated by the 
test in the description—{{reaped()}} happily disables escalation leaving the 
task process orphaned in the system.

2. In case we manage to enter {{escalated()}} callback, we should ensure all 
child of {{sh -c}} receive {{SIGKILL}}. I'm not sure current implementation via 
{{os::killtree}} provides such a guarantee.

As proposed by [~idownes], POSIX process groups might be a solution and reap 
the whole group. However, it would be still nice to obtain an OS pid of the 
task process, in order to deliver in status updates messages related to the 
task process, and not to the wrapper {{sh -c}}.

> Sending SIGTERM to a task command may render it orphaned
> --------------------------------------------------------
>
>                 Key: MESOS-1871
>                 URL: https://issues.apache.org/jira/browse/MESOS-1871
>             Project: Mesos
>          Issue Type: Bug
>          Components: slave
>            Reporter: Alexander Rukletsov
>
> {{CommandExecutor}} launches tasks wrapping them into {{sh -c}}. That means 
> signals are sent to the top process—that is {{sh -c}}—and not to the task 
> directly. Though {{SIGTERM}} is propagated by {{sh -c}} down the process 
> tree, if the task is unresponsive to {{SIGTERM}}, {{sh -c}} terminates 
> reporting success to the {{CommandExecutor}}, rendering the task detached 
> from the parent process and still running. Because the {{CommandExecutor}} 
> thinks the command terminated normally, its OS process exits normally and may 
> not trigger containerizer's escalation which destroys cgroups.
> Here is the test related to the first part: 
> [https://gist.github.com/rukletsov/68259dfb02421813f9e6].
> Here is the test related to the second part: 
> [https://gist.github.com/rukletsov/3f19ecc7389fa51e65c0].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to