[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250352#comment-16250352
 ] 

Alexander Rukletsov commented on MESOS-7506:
--------------------------------------------

{noformat}
Commit: a595bcbf7afd9783e15f3e32cd9e70fa979531df [a595bcb]
Author: Andrei Budnik <[email protected]>
Date: 13 November 2017 at 21:59:17 GMT+1
Committer: Alexander Rukletsov <[email protected]>
Commit Date: 13 November 2017 at 23:15:43 GMT+1

Fixed bug in tests leading to orphaned containers.

Previously, some tests tried to advance the clock until task status
update was sent, while task's container was destroying. Container
destruction consists of multiple steps, where some steps have a timeout
specified, e.g. `cgroups::DESTROY_TIMEOUT`. So, there was a race
between container destruction process and the loop that advanced the
clock, leading to the following outcomes:

  (1) Container destroyed, before clock advancing reaches timeout.

  (2) Triggered timeout due to clock advancing, before container
      destruction completes. That results in leaving orphaned
      containers that will be detected by Slave destructor in
      `tests/cluster.cpp`, so the test will fail.

This change gets rid of the loop and resumes clock after a single
advancing of the clock.

Review: https://reviews.apache.org/r/63589/
{noformat}

> Multiple tests leave orphan containers.
> ---------------------------------------
>
>                 Key: MESOS-7506
>                 URL: https://issues.apache.org/jira/browse/MESOS-7506
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>         Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>            Reporter: Alexander Rukletsov
>            Assignee: Andrei Budnik
>              Labels: containerizer, flaky-test, mesosphere
>         Attachments: KillMultipleTasks-badrun.txt, 
> ResourceLimitation-badrun.txt, ResourceLimitation-badrun2.txt, 
> RestartSlaveRequireExecutorAuthentication-badrun.txt, 
> TaskWithFileURI-badrun.txt
>
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}
> All currently affected tests:
> {noformat}
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.TaskWithFileURI/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.ResourceLimitation/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillMultipleTasks/0
> SlaveRecoveryTest/0.RecoverUnregisteredExecutor
> SlaveRecoveryTest/0.CleanupExecutor
> SlaveRecoveryTest/0.RecoverTerminatedExecutor
> SlaveTest.ShutdownUnregisteredExecutor
> SlaveTest.RestartSlaveRequireExecutorAuthentication
> ShutdownUnregisteredExecutor
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to