[jira] [Comment Edited] (MESOS-4025) SlaveRecoveryTest/0.GCExecutor is flaky.

Greg Mann (JIRA) Wed, 09 Dec 2015 18:11:05 -0800

    [ 
https://issues.apache.org/jira/browse/MESOS-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049741#comment-15049741
 ]


Greg Mann edited comment on MESOS-4025 at 12/10/15 2:09 AM:
------------------------------------------------------------

I was just observing this error on Ubuntu 14.04 when running the tests as root. 
It does indeed seem to be due to artifacts left behind by some tests. After 
running the entire test suite the first time I saw that several of the 
{{SlaveRecoveryTest}} tests had failed:

{code}
[==========] 988 tests from 144 test cases ran. (590989 ms total)
[  PASSED  ] 980 tests.
[  FAILED  ] 8 tests, listed below:
[  FAILED  ] SlaveRecoveryTest/0.GCExecutor, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveRecoveryTest/0.ShutdownSlave, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveRecoveryTest/0.ShutdownSlaveSIGUSR1, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveTest.PingTimeoutNoPings
[  FAILED  ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample
[  FAILED  ] CgroupsAnyHierarchyWithPerfEventTest.ROOT_CGROUPS_Perf
[  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
[  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
{code}

and the {{SlaveRecoveryTest}} errors all looked the same:

{code}
[ RUN      ] SlaveRecoveryTest/0.ShutdownSlave
../../src/tests/mesos.cpp:906: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave': 
Device or resource busy
-----------------------------------------------------------
We're very sorry but we can't seem to destroy existing
cgroups that we likely created as part of an earlier
invocation of the tests. Please manually destroy the cgroup
at 
'/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave' 
by first
manually killing all the processes found in the file at 
'/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave/tasks'
-----------------------------------------------------------
../../src/tests/mesos.cpp:940: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave': 
Device or resource busy
[  FAILED  ] SlaveRecoveryTest/0.ShutdownSlave, where TypeParam = 
mesos::internal::slave::MesosContainerizer (31 ms)
{code}

Next, re-running just these tests with {{sudo GTEST_FILTER="SlaveRecoverTest*" 
bin/mesos-tests.sh}}, they _all_ failed, with a slightly different error. At 
this point, these failures are reliably produced 100% of the time:

{code}
[ RUN      ] SlaveRecoveryTest/0.MultipleSlaves
../../src/tests/mesos.cpp:906: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/sys/fs/cgroup/perf_event/mesos_test': Device or resource busy
-----------------------------------------------------------
We're very sorry but we can't seem to destroy existing
cgroups that we likely created as part of an earlier
invocation of the tests. Please manually destroy the cgroup
at '/sys/fs/cgroup/perf_event/mesos_test' by first
manually killing all the processes found in the file at 
'/sys/fs/cgroup/perf_event/mesos_test/tasks'
-----------------------------------------------------------
../../src/tests/mesos.cpp:940: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/sys/fs/cgroup/perf_event/mesos_test': Device or resource busy
[  FAILED  ] SlaveRecoveryTest/0.MultipleSlaves, where TypeParam = 
mesos::internal::slave::MesosContainerizer (14 ms)
{code}

Finally, I recompiled from scratch and ran _only_ the {{SlaveRecoveryTest}} 
tests, doing:

{code}
GTEST_FILTER="" make check
sudo GTEST_FILTER="SlaveRecoveryTest*" bin/mesos-tests.sh
{code}

and they all passed. Regarding [~nfnt]'s comment above on the 
{{HealthCheckTest}} tests, after all this I was able to run {{sudo 
GTEST_FILTER="HealthCheckTest*" bin/mesos-tests.sh}}, followed by {{sudo 
GTEST_FILTER="SlaveRecoveryTest*" bin/mesos-tests.sh}}, and all of the 
{{SlaveRecoveryTest}} tests passed, so perhaps it isn't an artifact of the 
{{HealthCheckTest}} tests that is causing this problem? However, I only did 
that a couple times, and didn't try the exact command that Jan did:

{code}
sudo ./bin/mesos-tests.sh --gtest_repeat=1 --gtest_break_on_failure 
--gtest_filter="*ROOT_DOCKER_DockerHealthStatusChange:SlaveRecoveryTest*GCExecutor"
{code}

so if it's a flaky thing I may not have caught it.


was (Author: greggomann):
I was just observing this error on Ubuntu 14.04 when running the tests as root. 
It does indeed seem to be due to artifacts left behind by some tests. After 
running the entire test suite the first time I saw that several of the 
{{SlaveRecoveryTest}} tests had failed:

{code}
[==========] 988 tests from 144 test cases ran. (590989 ms total)
[  PASSED  ] 980 tests.
[  FAILED  ] 8 tests, listed below:
[  FAILED  ] SlaveRecoveryTest/0.GCExecutor, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveRecoveryTest/0.ShutdownSlave, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveRecoveryTest/0.ShutdownSlaveSIGUSR1, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveTest.PingTimeoutNoPings
[  FAILED  ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample
[  FAILED  ] CgroupsAnyHierarchyWithPerfEventTest.ROOT_CGROUPS_Perf
[  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
[  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
{code}

and the {{SlaveRecoveryTest}} errors all looked the same:

{code}
[ RUN      ] SlaveRecoveryTest/0.ShutdownSlave
../../src/tests/mesos.cpp:906: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave': 
Device or resource busy
-----------------------------------------------------------
We're very sorry but we can't seem to destroy existing
cgroups that we likely created as part of an earlier
invocation of the tests. Please manually destroy the cgroup
at 
'/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave' 
by first
manually killing all the processes found in the file at 
'/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave/tasks'
-----------------------------------------------------------
../../src/tests/mesos.cpp:940: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave': 
Device or resource busy
[  FAILED  ] SlaveRecoveryTest/0.ShutdownSlave, where TypeParam = 
mesos::internal::slave::MesosContainerizer (31 ms)
{code}

Next, re-running just these tests with {{sudo GTEST_FILTER="SlaveRecoverTest*" 
bin/mesos-tests.sh}}, they _all_ failed, with a slightly different error. At 
this point, these failures are reliably produced 100% of the time:

{code}
[ RUN      ] SlaveRecoveryTest/0.MultipleSlaves
../../src/tests/mesos.cpp:906: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/sys/fs/cgroup/perf_event/mesos_test': Device or resource busy
-----------------------------------------------------------
We're very sorry but we can't seem to destroy existing
cgroups that we likely created as part of an earlier
invocation of the tests. Please manually destroy the cgroup
at '/sys/fs/cgroup/perf_event/mesos_test' by first
manually killing all the processes found in the file at 
'/sys/fs/cgroup/perf_event/mesos_test/tasks'
-----------------------------------------------------------
../../src/tests/mesos.cpp:940: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/sys/fs/cgroup/perf_event/mesos_test': Device or resource busy
[  FAILED  ] SlaveRecoveryTest/0.MultipleSlaves, where TypeParam = 
mesos::internal::slave::MesosContainerizer (14 ms)
{code}

Finally, I recompiled from scratch and ran _only_ the {{SlaveRecoveryTest}} 
tests, doing:

{code}
GTEST_FILTER="" make check
sudo GTEST_FILTER="SlaveRecoveryTest*" bin/mesos-tests.sh
{code}

and they all passed. Regarding [~nfnt]'s comment above on the 
{{HealthCheckTest}} tests, after all this I was able to run {{sudo 
GTEST_FILTER="HealthCheckTest*" bin/mesos-tests.sh}}, followed by {{sudo 
GTEST_FILTER="SlaveRecoveryTest*" bin/mesos-tests.sh}}, and all of the 
{{SlaveRecoveryTest}} tests passed, so perhaps it isn't an artifact of the 
{{HealthCheckTest}} tests that is causing this problem.

> SlaveRecoveryTest/0.GCExecutor is flaky.
> ----------------------------------------
>
>                 Key: MESOS-4025
>                 URL: https://issues.apache.org/jira/browse/MESOS-4025
>             Project: Mesos
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.26.0
>            Reporter: Till Toenshoff
>            Assignee: Jan Schlicht
>              Labels: flaky, flaky-test, test
>
> Build was SSL enabled (--enable-ssl, --enable-libevent). The build was based 
> on 0.26.0-rc1.
> Testsuite was run as root.
> {noformat}
> sudo ./bin/mesos-tests.sh --gtest_break_on_failure --gtest_repeat=-1
> {noformat}
> {noformat}
> [ RUN      ] SlaveRecoveryTest/0.GCExecutor
> I1130 16:49:16.336833  1032 exec.cpp:136] Version: 0.26.0
> I1130 16:49:16.345212  1049 exec.cpp:210] Executor registered on slave 
> dde9fd4e-b016-4a99-9081-b047e9df9afa-S0
> Registered executor on ubuntu14
> Starting task 22c63bba-cbf8-46fd-b23a-5409d69e4114
> sh -c 'sleep 1000'
> Forked command at 1057
> ../../src/tests/mesos.cpp:779: Failure
> (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
> '/sys/fs/cgroup/memory/mesos_test_e5edb2a8-9af3-441f-b991-613082f264e2/slave':
>  Device or resource busy
> *** Aborted at 1448902156 (unix time) try "date -d @1448902156" if you are 
> using GNU date ***
> PC: @          0x1443e9a testing::UnitTest::AddTestPartResult()
> *** SIGSEGV (@0x0) received by PID 27364 (TID 0x7f1bfdd2b800) from PID 0; 
> stack trace: ***
>     @     0x7f1be92b80b7 os::Linux::chained_handler()
>     @     0x7f1be92bc219 JVM_handle_linux_signal
>     @     0x7f1bf7bbc340 (unknown)
>     @          0x1443e9a testing::UnitTest::AddTestPartResult()
>     @          0x1438b99 testing::internal::AssertHelper::operator=()
>     @           0xf0b3bb 
> mesos::internal::tests::ContainerizerTest<>::TearDown()
>     @          0x1461882 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
>     @          0x145c6f8 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
>     @          0x143de4a testing::Test::Run()
>     @          0x143e584 testing::TestInfo::Run()
>     @          0x143ebca testing::TestCase::Run()
>     @          0x1445312 testing::internal::UnitTestImpl::RunAllTests()
>     @          0x14624a7 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
>     @          0x145d26e 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
>     @          0x14440ae testing::UnitTest::Run()
>     @           0xd15cd4 RUN_ALL_TESTS()
>     @           0xd158c1 main
>     @     0x7f1bf7808ec5 (unknown)
>     @           0x913009 (unknown)
> {noformat}
> My Vagrantfile generator;
> {noformat}
> #!/usr/bin/env bash
> cat << EOF > Vagrantfile
> # -*- mode: ruby -*-" >
> # vi: set ft=ruby :
> Vagrant.configure(2) do |config|
>   # Disable shared folder to prevent certain kernel module dependencies.
>   config.vm.synced_folder ".", "/vagrant", disabled: true
>   config.vm.box = "bento/ubuntu-14.04"
>   config.vm.hostname = "${PLATFORM_NAME}"
>   config.vm.provider "virtualbox" do |vb|
>     vb.memory = ${VAGRANT_MEM}
>     vb.cpus = ${VAGRANT_CPUS}
>     vb.customize ["modifyvm", :id, "--nictype1", "virtio"]
>     vb.customize ["modifyvm", :id, "--natdnshostresolver1", "on"]
>     vb.customize ["modifyvm", :id, "--natdnsproxy1", "on"]
>   end
>   config.vm.provider "vmware_fusion" do |vb|
>     vb.memory = ${VAGRANT_MEM}
>     vb.cpus = ${VAGRANT_CPUS}
>   end
>   config.vm.provision "file", source: "../test.sh", destination: "~/test.sh"
>   config.vm.provision "shell", inline: <<-SHELL
>     sudo apt-get update
>     sudo apt-get -y install openjdk-7-jdk autoconf libtool
>     sudo apt-get -y install build-essential python-dev python-boto          \
>                             libcurl4-nss-dev libsasl2-dev maven             \
>                             libapr1-dev libsvn-dev libssl-dev libevent-dev
>     sudo apt-get -y install git
>     sudo wget -qO- https://get.docker.com/ | sh
>   SHELL
> end
> EOF
> {noformat}
> The problem is kicking in frequently in my tests - I'ld say > 10% but less 
> than 50%.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-4025) SlaveRecoveryTest/0.GCExecutor is flaky.

Reply via email to