[
https://issues.apache.org/jira/browse/MESOS-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431156#comment-16431156
]
Andrei Budnik commented on MESOS-8489:
--------------------------------------
[https://reviews.apache.org/r/66404/]
[https://reviews.apache.org/r/66474/]
>From `man 7 cgroups`:
{code:java}
freezer (since Linux 2.6.28; CONFIG_CGROUP_FREEZER)
The freezer cgroup can suspend and restore (resume) all tasks in
a cgroup. Freezing a cgroup /A also causes its children, for
example, tasks in /A/B, to be frozen.
{code}
I've double-checked this ^^ assertion by installing Ubuntu 9.04 which runs on
Linux kernel 2.6.28 and creating a nested freezer cgroup manually: !ubuntu
9.04.png!
> LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags is flaky
> --------------------------------------------------------------
>
> Key: MESOS-8489
> URL: https://issues.apache.org/jira/browse/MESOS-8489
> Project: Mesos
> Issue Type: Bug
> Components: containerization
> Reporter: Andrei Budnik
> Assignee: Andrei Budnik
> Priority: Major
> Labels: containerizer, flaky-test, mesosphere
> Attachments: ROOT_IsolatorFlags-badrun3.txt, ubuntu 9.04.png
>
>
> Observed this on internal Mesosphere CI.
> {code:java}
> ../../src/tests/cluster.cpp:662: Failure
> Value of: containers->empty()
> Actual: false
> Expected: true
> Failed to destroy containers: { test }
> {code}
> h2. Steps to reproduce
> # Add {{::sleep(1);}} before
> [removing|https://github.com/apache/mesos/blob/e91ce42ed56c5ab65220fbba740a8a50c7f835ae/src/linux/cgroups.cpp#L483]
> "test" cgroup
> # recompile
> # run `GLOG_v=2 sudo GLOG_v=2 ./src/mesos-tests
> --gtest_filter=LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags
> --gtest_break_on_failure --gtest_repeat=10 --verbose`
> h2. Race description
> While recovery is in progress for [the first
> slave|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L733],
> calling
> [`StartSlave()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L738]
> leads to calling
> [`slave::Containerizer::create()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/cluster.cpp#L431]
> to create a containerizer. An attempt to create a mesos c'zer, leads to
> calling
> [`cgroups::prepare`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L124].
> Finally, we get to the point, where we try to create a ["test"
> container|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L476].
> So, the recovery process for the second slave [might
> detect|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L268-L301]
> this "test" container as an orphaned container.
> Thus, there is the race between recovery process for the first slave and an
> attempt to create a c'zer for the second agent.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)