Folks,

I would like to share some facts and thoughts about tests. When I ran `make
check -j7` on my Mac OS machine the other day, gtest reported the following
(your numbers may vary depending on the OS you're on and filters you use):
[==========] 882 tests from 117 test cases ran. (298610 ms total)

Same command for Mesos 0.21.1, which has been released around a year ago,
yields
[==========] 452 tests from 71 test cases ran. (196398 ms total)

We almost doubled the number of tests in 2015. I think this is a great
achievement per se, moreover it makes the life of cluster operators,
release managers, and Mesos contributors less stressful. I am going to have
an extra glass of champagne to celebrate this at the upcoming New Year Eve
: ).

There are still some flaky tests left — and there always will be, failure
is embedded into progress —, but it is not the flakiness I would like to
discuss today. I would like to draw your attention to the last number in
the gtest output lines above.

When adding tests, we also contribute to the time it takes for a complete
test suite to run. There are multiple ways how we can keep this number
small (one is, heh, write less tests : ) ). Today I propose to focus on
reducing duration of individual test cases.

Mesos tests are often build around certain sequences of events, some of
those have timeouts, some are dependent on other events. Naive test
implementations sometimes lead to test being blocked by the duration of
some timeout, pointlessly slowing down the whole suite! A good indicator of
such a test is that its duration is an integral number of seconds (the
timeout) plus some delta (actual testing code), for example 3123 ms, 5076
ms.

Suggestion: If you write a new test, please look at the test duration as
well, if it seems unreasonably long, investigate what the reasons are and
how you can make the test faster.

State of the art:
  * Slave recovery tests are known to be slow, see MESOS-733 [1].
  * Ben Mahler created an epic to track slow tests more than a year ago
(MESOS-1757 [2]) and did some work earlier (MESOS-297 [3]).
  * Dominic Hamon did pretty much what I have done (with a much nicer
command, too bad I noticed that after generating the list myself) and filed
MESOS-2059 [4].

To get a list of suspect tests I ran `./bin/mesos-tests.sh 2>/dev/null |
grep "ms)"` and noted down tests that took more than 1 second to complete.
To my knowledge, 1s is the shortest timeout we use in default values for
configurable parameters.

For each test from the list I either created a JIRA ticket, or grouped a
bunch of seemingly related tickets into an epic (details below). I hijacked
MESOS-1757 [2] and made it a parent for all newly created epics and tickets.

I would like to encourage folks to look at these tickets and work on them
when they have time and mood. Apart making `make check` faster, I believe
that most of these tickets are actually a very good way to familiarize
yourself with the Mesos codebase (hence I marked all tickets as
`newbie++`), so if you would like to contribute to Mesos but do not know
where to start — this can be a good choice!

It is clear that some tickets are false positives and there exists a good
reason why this particular test takes longer than others. In this case a
comment explaining this reason is a proper resolution for the ticket.

To avoid difficulties with finding a shepherd, I would suggest
investigating the test first, understanding the reason for the slowness,
and updating the ticket, so that a potential shepherd can easier estimate
the amount of time necessary for fixing the issue. Investigating does not
require a shepherd, and once it is done, all following steps (finding a
shepherd, submitting a patch, getting it committed) are trivial.

I believe some tests may share the same root cause (for example, they rely
on the same timeout, which cannot be changed from the test harness). In
this case all such tests can be fixed by a single change.

Below are the suspect tests.
  * Examples tests, slow since early days, see MESOS-297 [3]. Filed
MESOS-4155 [6].
  * Fetcher cache and fetcher cache http tests, filed MESOS-4156 [7].
  * Zookeeper tests, some are slow since early days, see MESOS-297 [3].
Filed MESOS-4157 [8].
  * Slave recovery tests. Known to be slow, see MESOS-733 [1] and MESOS-297
[3]. Filed MESOS-4158 [9].
  * Group tests, filed MESOS-4159 [10].
  * Recover tests, filed MESOS-4160 [11].

  * SlaveTest.CommandExecutorWithOverride (1311 ms), filed MESOS-4161 [12].
  * SlaveTest.MetricsSlaveLaunchErrors (1009 ms), filed MESOS-4162 [13].
  * SlaveTest.HTTPSchedulerSlaveRestart (2307 ms), filed MESOS-4163 [14].
  * MasterTest.RecoverResources (1018 ms), filed MESOS-4164 [15].
  * MasterTest.MasterInfoOnReElection (1024 ms), filed MESOS-4165 [16].
  * MasterTest.LaunchCombinedOfferTest (2023 ms), filed MESOS-4166 [17].
  * MasterTest.OfferTimeout (1053 ms), filed MESOS-4167 [18].
  * MasterAllocatorTest/0.SlaveLost (5076 ms). Allocator related test,
MESOS-3775 [5]. The tests waits 5s for an executor to terminate.
  * MasterMaintenanceTest.EnterMaintenanceMode (5087 ms), filed MESOS-4168
[19].
  * MasterMaintenanceTest.InverseOffers (2027 ms), filed MESOS-4169 [20].
  * OversubscriptionTest.UpdateAllocatorOnSchedulerFailover (1018 ms),
filed MESOS-4170 [21].
  * OversubscriptionTest.RemoveCapabilitiesOnSchedulerFailover (1018 ms),
filed MESOS-4171 [22].
  * GarbageCollectorIntegrationTest.Restart (5102 ms), filed MESOS-4172
[23].
  * HealthCheckTest.CheckCommandTimeout (15483 ms), filed MESOS-4173 [24].
  * HookTest.VerifySlaveLaunchExecutorHook (5061 ms), filed MESOS-4174 [25].
  * ContentType/SchedulerTest.Decline/0 (1022 ms), filed MESOS-4175 [26].

Thanks for reading this up till this point,
AlexR


[1] https://issues.apache.org/jira/browse/MESOS-733
[2] https://issues.apache.org/jira/browse/MESOS-1757
[3] https://issues.apache.org/jira/browse/MESOS-297
[4] https://issues.apache.org/jira/browse/MESOS-2059
[5] https://issues.apache.org/jira/browse/MESOS-3775
[6] https://issues.apache.org/jira/browse/MESOS-4155
[7] https://issues.apache.org/jira/browse/MESOS-4156
[8] https://issues.apache.org/jira/browse/MESOS-4157
[9] https://issues.apache.org/jira/browse/MESOS-4158
[10] https://issues.apache.org/jira/browse/MESOS-4159
[11] https://issues.apache.org/jira/browse/MESOS-4160
[12] https://issues.apache.org/jira/browse/MESOS-4161
[13] https://issues.apache.org/jira/browse/MESOS-4162
[14] https://issues.apache.org/jira/browse/MESOS-4163
[15] https://issues.apache.org/jira/browse/MESOS-4164
[16] https://issues.apache.org/jira/browse/MESOS-4165
[17] https://issues.apache.org/jira/browse/MESOS-4166
[18] https://issues.apache.org/jira/browse/MESOS-4167
[19] https://issues.apache.org/jira/browse/MESOS-4168
[20] https://issues.apache.org/jira/browse/MESOS-4169
[21] https://issues.apache.org/jira/browse/MESOS-4170
[22] https://issues.apache.org/jira/browse/MESOS-4171
[23] https://issues.apache.org/jira/browse/MESOS-4172
[24] https://issues.apache.org/jira/browse/MESOS-4173
[25] https://issues.apache.org/jira/browse/MESOS-4174
[26] https://issues.apache.org/jira/browse/MESOS-4175

Reply via email to