[
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243729#comment-16243729
]
Andrei Budnik commented on MESOS-7506:
--------------------------------------
*Second cause*
{{[ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.TaskWithFileURI/0|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/tests/default_executor_tests.cpp#L1912]}}
launches task group, so each task is launched using {{ComposingContainerizer}}.
When this test completes (after receiving TASK_FINISHED status update), Slave
d-tor is called, where [it
waits|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/tests/cluster.cpp#L574]
for each container to trigger a [container's termination
future|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/slave/containerizer/mesos/containerizer.cpp#L2528].
As this test uses {{ComposingContainerizer}}, [calling
destroy|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/tests/cluster.cpp#L572]
for a container means {{ComposingContainerizer}} subscribes for the same
[container's termination
future|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/slave/containerizer/composing.cpp#L638-L647]
via {{onAny}} method. Once this future is triggered, the lambda function is
dispatched. This lambda removes {{containerId}} from the hash set.
When a container's termination future is set [is
set|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/3rdparty/libprocess/include/process/future.hpp#L1524],
then
{{[AWAIT(wait)|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/tests/cluster.cpp#L574]}}
might [be
satisfied|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/3rdparty/libprocess/include/process/gtest.hpp#L83],
hence container's hash set will be [requested
(dispatched)|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/tests/cluster.cpp#L577].
There is a race between a thread which sets the container's termination
future, calling {{onReadyCallbacks}} and {{onAnyCallbacks}}, where calling
{{onAnyCallbacks}} leads to dispatching aforementioned lambda, and a test
thread which waits for the container's termination future and then calls
{{containerizer->containers()}}.
To reproduce this case, we need to add one sleep for ~10ms before
[internal::run(copy->onAnyCallbacks,
*this)|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/3rdparty/libprocess/include/process/future.hpp#L1537]
and remove another sleep from [process::internal::await
|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/3rdparty/libprocess/include/process/gtest.hpp#L92].
> Multiple tests leave orphan containers.
> ---------------------------------------
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
> Issue Type: Bug
> Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
> Reporter: Alexander Rukletsov
> Assignee: Andrei Budnik
> Labels: containerizer, flaky-test, mesosphere
> Attachments: KillMultipleTasks-badrun.txt,
> ResourceLimitation-badrun.txt, TaskWithFileURI-badrun.txt
>
>
> I've observed a number of flaky tests that leave orphan containers upon
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
> Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}
> All currently affected tests:
> {noformat}
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.TaskWithFileURI/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.ResourceLimitation/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillMultipleTasks/0
> SlaveRecoveryTest/0.RecoverUnregisteredExecutor
> SlaveRecoveryTest/0.CleanupExecutor
> SlaveRecoveryTest/0.RecoverTerminatedExecutor
> SlaveTest.ShutdownUnregisteredExecutor
> ShutdownUnregisteredExecutor
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)