Andrei Budnik created MESOS-8489:
------------------------------------
Summary: LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags is
flaky
Key: MESOS-8489
URL: https://issues.apache.org/jira/browse/MESOS-8489
Project: Mesos
Issue Type: Bug
Components: containerization
Reporter: Andrei Budnik
Attachments: ROOT_IsolatorFlags-badrun3.txt
Observed this on internal Mesosphere CI.
{code:java}
../../src/tests/cluster.cpp:662: Failure
Value of: containers->empty()
Actual: false
Expected: true
Failed to destroy containers: { test }
{code}
h2. Steps to reproduce
# Add {{::sleep(1);}} before
[removing|https://github.com/apache/mesos/blob/e91ce42ed56c5ab65220fbba740a8a50c7f835ae/src/linux/cgroups.cpp#L483]
"test" cgroup
# recompile
# run `GLOG_v=2 sudo GLOG_v=2 ./src/mesos-tests
--gtest_filter=LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags
--gtest_break_on_failure --gtest_repeat=10 --verbose`
h2. Race description
While recovery is in progress for [the first
slave|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L733],
calling
[`StartSlave()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L738]
leads to calling
[`slave::Containerizer::create()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/cluster.cpp#L431]
to create a containerizer. An attempt to create a mesos c'zer, leads to
calling
[`cgroups::prepare`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L124].
Finally, we get to the point, where we try to create a ["test"
container|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L476].
So, the recovery process for the first slave [might
detect|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L268-L301]
this "test" container as an orphaned container.
Thus, there is the race between recovery process for the first slave and an
attempt to create a c'zer for the second agent.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)