[ 
https://issues.apache.org/jira/browse/MESOS-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15388588#comment-15388588
 ] 

Avinash Sridharan commented on MESOS-5879:
------------------------------------------

This does seem odd. The only place we free up the net_cls handles is during the 
`cleanup` of the containers:
https://github.com/apache/mesos/blob/a61074586d778d432ba991701c9c4de9459db897/src/slave/containerizer/mesos/isolators/cgroups/net_cls.cpp#L643

The only explanation that seems to come to mind is that somehow the cgroups 
were not deleted but the handle got freed, and the handle ended up getting 
allocated to a different container. Running into this issue. However this 
hypothesis does seem a bit implausible.

Is it possible to dump the net_cls cgroups that you are observing on your 
system. Just to make sure there in fact are two containers with the same 
net_cls handle.

Also, could you dump the Agent logs before and after recovery (whichever might 
be available). Just to see if we can get some more hints into the problem.

> cgroups/net_cls isolator causing agent recovery issues
> ------------------------------------------------------
>
>                 Key: MESOS-5879
>                 URL: https://issues.apache.org/jira/browse/MESOS-5879
>             Project: Mesos
>          Issue Type: Bug
>          Components: cgroups, isolation, slave
>            Reporter: Silas Snider
>            Assignee: Avinash Sridharan
>
> We run with 'cgroups/net_cls' in our isolator list, and when we restart any 
> agent process in a cluster running an experimental custom isolator as well, 
> the agents are unable to recover from checkpoint, because net_cls reports 
> that unknown orphan containers have duplicate net_cls handles.
> While this is a problem that needs to be solved (probably by fixing our 
> custom isolator), it's also a problem that the net_cls isolator fails 
> recovery just for duplicate handles in cgroups that it is literally about to 
> unconditionally destroy during recovery. Can this be fixed?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to