[jira] [Commented] (MESOS-6414) Task cleanup fails when the containers includes cgroups not owned by Mesos

Anindya Sinha (JIRA) Wed, 19 Oct 2016 10:57:03 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15589394#comment-15589394
 ]


Anindya Sinha commented on MESOS-6414:
--------------------------------------

Let us assume a task is launched which creates a sub-cgroup through an external 
service. So, the cgroup hierarchy is something like:
/sys/fs/cgroup/freezer/mesos/<mesos-cgroup>/<sub-cgroup>

Say the task fails, so the container exits, and when launcher->destroy() is 
called, we do a recursive cgroups::get() to get all cgroups and we get absolute 
paths for both <mesos-cgroup> as well as <sub-cgroup>. And then the 
TasksKiller() is initiated for <mesos-cgroup> as well as <sub-cgroup> resulting 
in freeze(), thaw(), etc. for each of them in parallel, followed by a killed().

However, since the <sub-cgroup> is created by an external service, that service 
may do a cleanup of <sub-cgroup> without Mesos' knowledge.  If that happens, 
any of the cleanup operations (freeze(), thaw(), etc) for the <sub-cgroup> may 
fail in the flow of TasksKiller() for the <sub-cgroup> (since the external 
service removed /sys/fs/cgroup/freezer/mesos/<mesos-cgroup>/<sub-cgroup> before 
Mesos could do a cleanup in TasksKiller). As a result, we exit out of cleanup 
of <mesos-cgroup> at that point which seems incorrect since all cleanup has 
actually happened.

To avoid this issue (ie. race of cleanup of <sub-cgroup> between the external 
service and Mesos), I am suggesting to treat failure in any of these steps as a 
failure for all cases except when the failure is due to non-existence of 
<sub-cgroup> (ie. it has already been cleaned up by an external service, so we 
treat this as a success).



> Task cleanup fails when the containers includes cgroups not owned by Mesos
> --------------------------------------------------------------------------
>
>                 Key: MESOS-6414
>                 URL: https://issues.apache.org/jira/browse/MESOS-6414
>             Project: Mesos
>          Issue Type: Bug
>          Components: cgroups
>            Reporter: Anindya Sinha
>            Assignee: Anindya Sinha
>            Priority: Minor
>
> If a mesos task is launched in a cgroup outside of the context of Mesos,  
> Mesos is unaware of that cgroup created in the task context.
> Now when the Mesos task terminates: Mesos tries to cleanup all cgroups within 
> the top level cgroup it knows about. If the cgroup created in the task 
> context exists when LinuxLauncherProcess::destroy() is called but is 
> eventually cleaned up by the container before we do a freeze() or thaw() or 
> remove(), it fails at those stages leading to an incomplete cleanup of the 
> container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6414) Task cleanup fails when the containers includes cgroups not owned by Mesos

Reply via email to