[ 
https://issues.apache.org/jira/browse/MESOS-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417999#comment-16417999
 ] 

Gilbert Song commented on MESOS-8489:
-------------------------------------

[~abudnik], thanks for the triaging. However, I think we did not understand 
this issue deep enough:
# The race description seems not accurate enough to me. The race is between the 
destruction of the first cluster::slave and the orphan container destroy in the 
second slave's recovery path. We should reset the Owned pointer first before we 
call next StartSlave(). (This would fix the flakiness in this unit test)
# We need to understand why the nested *test* cgroup is still there when we 
create the first slave, since it is just a simple os::rmdir(). This is the 
trigger of the flakiness. The *test* cgroup is supposed to be created and 
removed immediately. There might be a bug in cgroup::remove(). 
https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L485
# The nested *test* cgroup may no longer be needed since it was a workaround 
for old kernel versions. Could you do some investigations on whether this is 
supported by kenel version later than 2.6? We may be able to remove these code 
and document it (Still need to understand #2 though). 
https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L461~#L488

> LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags is flaky
> --------------------------------------------------------------
>
>                 Key: MESOS-8489
>                 URL: https://issues.apache.org/jira/browse/MESOS-8489
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>            Reporter: Andrei Budnik
>            Assignee: Andrei Budnik
>            Priority: Major
>              Labels: containerizer, flaky-test, mesosphere
>         Attachments: ROOT_IsolatorFlags-badrun3.txt
>
>
> Observed this on internal Mesosphere CI.
> {code:java}
> ../../src/tests/cluster.cpp:662: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { test }
> {code}
> h2. Steps to reproduce
>  # Add {{::sleep(1);}} before 
> [removing|https://github.com/apache/mesos/blob/e91ce42ed56c5ab65220fbba740a8a50c7f835ae/src/linux/cgroups.cpp#L483]
>  "test" cgroup
>  # recompile
>  # run `GLOG_v=2 sudo GLOG_v=2 ./src/mesos-tests 
> --gtest_filter=LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags 
> --gtest_break_on_failure --gtest_repeat=10 --verbose`
> h2. Race description
> While recovery is in progress for [the first 
> slave|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L733],
>  calling 
> [`StartSlave()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L738]
>  leads to calling 
> [`slave::Containerizer::create()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/cluster.cpp#L431]
>  to create a containerizer. An attempt to create a mesos c'zer, leads to 
> calling 
> [`cgroups::prepare`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L124].
>  Finally, we get to the point, where we try to create a ["test" 
> container|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L476].
>  So, the recovery process for the first slave [might 
> detect|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L268-L301]
>  this "test" container as an orphaned container.
> Thus, there is the race between recovery process for the first slave and an 
> attempt to create a c'zer for the second agent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to