[jira] [Commented] (MESOS-7506) Multiple tests leave orphan containers.

Andrei Budnik (JIRA) Wed, 18 Oct 2017 08:33:47 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16209541#comment-16209541
 ]


Andrei Budnik commented on MESOS-7506:
--------------------------------------

All failing tests have the same error message in logs like:
{{E0922 00:38:40.509032 31034 slave.cpp:5398] Termination of executor '1' of 
framework 83bd1613-70d9-4c3e-b490-4aa60dd26e22-0000 failed: Failed to kill all 
processes in the container: Timed out after 1mins}}

The container termination future is triggered by 
[MesosContainerizerProcess::___destroy|https://github.com/apache/mesos/blob/b361801f2c78043459199dab3e0defe9a0b4c1aa/src/slave/containerizer/mesos/containerizer.cpp#L2361].
 Agent subscribes to this future by calling 
[containerizer->wait()|https://github.com/apache/mesos/blob/b361801f2c78043459199dab3e0defe9a0b4c1aa/src/slave/slave.cpp#L5280].
 Triggering this future leads to calling of {{Slave::executorTerminated}}, 
which sends {{TASK_FAILED}} status update.

Typical test (e.g. {{SlaveTest.ShutdownUnregisteredExecutor}}) waits for
{code}
  // Ensure that the slave times out and kills the executor.
  Future<Nothing> destroyExecutor =
    FUTURE_DISPATCH(_, &MesosContainerizerProcess::destroy);
{code}

After that, the test waits for {{TASK_FAILED}} status update. So, this test 
completes successfully and slave's destructor is called, [which 
fails|https://github.com/apache/mesos/blob/b361801f2c78043459199dab3e0defe9a0b4c1aa/src/tests/cluster.cpp#L580],
 because {{MesosContainerizerProcess::___destroy}} doesn't erase container from 
the hashmap.

> Multiple tests leave orphan containers.
> ---------------------------------------
>
>                 Key: MESOS-7506
>                 URL: https://issues.apache.org/jira/browse/MESOS-7506
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>         Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>            Reporter: Alexander Rukletsov
>            Assignee: Andrei Budnik
>              Labels: containerizer, flaky-test, mesosphere
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (MESOS-7506) Multiple tests leave orphan containers.

Reply via email to