[
https://issues.apache.org/jira/browse/MESOS-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593240#comment-15593240
]
Jie Yu commented on MESOS-6414:
-------------------------------
I think we need to revisit the whole cgroups destroy path given that there
could be multiple entities to mutate the same cgroup.
I think it makes sense that the process launched my Mesos wants to manipulate
its own cgroup (e.g., sub-divide cgroups for tasks). However, I don't think it
makes sense to allow a process on the agent to manipulate the same cgroup
managed by Mesos. Even if Mesos supports that, the process on the agent might
not tolerate that.
If we keep that in mind, i think the correct sequence should be:
1) Try to kill all processes in the cgroup (including all nested cgroups). This
makes sure that the process that can manipulate nested cgroups goes away.
2) Try to remove all cgroups.
> cgroups isolator cleanup failed when the hierarchy is cleanup by docker
> daemon
> -------------------------------------------------------------------------------
>
> Key: MESOS-6414
> URL: https://issues.apache.org/jira/browse/MESOS-6414
> Project: Mesos
> Issue Type: Bug
> Components: cgroups
> Reporter: Anindya Sinha
> Assignee: Anindya Sinha
> Priority: Minor
> Labels: containerizer
> Fix For: 1.2.0
>
>
> Now if we launch a docker container in Mesos containerizer, the racing may
> happen
> between docker daemon and Mesos containerizer during cgroups operations.
> For example, when the docker container which run in Mesos containerizer OOM
> exit,
> Mesos containerizer would destroy following hierarchies
> {code}
> /sys/fs/cgroup/freezer/mesos/<mesos-cgroup>/<docker-cgroup>
> /sys/fs/cgroup/freezer/mesos/<mesos-cgroup>
> {code}
> But the docker daemon may destroy
> {code}
> /sys/fs/cgroup/freezer/mesos/<mesos-cgroup>/<docker-cgroup>
> {code}
> at the same time.
> If the docker daemon destroy the hierarchy first, then the Mesos
> containerizer would
> failed during {{CgroupsIsolatorProcess::cleanup}} because it could not find
> that hierarchy
> when destroying.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)