[
https://issues.apache.org/jira/browse/MESOS-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15390373#comment-15390373
]
Silas Snider commented on MESOS-5879:
-------------------------------------
We've discovered the issue. When we launch tasks, we create a child cgroup in
all the mesos cgroups to run part of our tasks inside. When we do this in
net_cls, the agent fails to recover because it's detecting a child container
that has the same classid as the parent, which is totally valid in this case.
We didn't realize this was happening because our child container names are
similar to mesos container names.
> cgroups/net_cls isolator causing agent recovery issues
> ------------------------------------------------------
>
> Key: MESOS-5879
> URL: https://issues.apache.org/jira/browse/MESOS-5879
> Project: Mesos
> Issue Type: Bug
> Components: cgroups, isolation, slave
> Reporter: Silas Snider
> Assignee: Avinash Sridharan
>
> We run with 'cgroups/net_cls' in our isolator list, and when we restart any
> agent process in a cluster running an experimental custom isolator as well,
> the agents are unable to recover from checkpoint, because net_cls reports
> that unknown orphan containers have duplicate net_cls handles.
> While this is a problem that needs to be solved (probably by fixing our
> custom isolator), it's also a problem that the net_cls isolator fails
> recovery just for duplicate handles in cgroups that it is literally about to
> unconditionally destroy during recovery. Can this be fixed?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)