AlexR, thanks for this great work! :-D It's nice to hear that so many tests
have been added in the past year, and I appreciate the list of tickets to
check out; I'll definitely take one on soon when I have some time.

I'd like to bring up something that both Neil and Joseph mentioned to me
recently, which could be of use when working on these slow test tickets.
Since we have the `process::Clock` class, it's quite easy to control the
clock manually, and doing so can both speed up tests as well as make them
more deterministic/less flaky. While we're working on the above tickets, I
think it would be nice to look for opportunities to alter the tests we're
touching to pause the clock and then advance it explicitly using `pause()`,
`settle()`, and `advance()`, rather than letting it run as usual.

Cheers,
Greg

On Wed, Dec 16, 2015 at 9:01 AM, tommy xiao <xia...@gmail.com> wrote:

> +1
>
> 2015-12-16 2:15 GMT+08:00 Alex Rukletsov <a...@mesosphere.com>:
>
> > Folks,
> >
> > I would like to share some facts and thoughts about tests. When I ran
> `make
> > check -j7` on my Mac OS machine the other day, gtest reported the
> following
> > (your numbers may vary depending on the OS you're on and filters you
> use):
> > [==========] 882 tests from 117 test cases ran. (298610 ms total)
> >
> > Same command for Mesos 0.21.1, which has been released around a year ago,
> > yields
> > [==========] 452 tests from 71 test cases ran. (196398 ms total)
> >
> > We almost doubled the number of tests in 2015. I think this is a great
> > achievement per se, moreover it makes the life of cluster operators,
> > release managers, and Mesos contributors less stressful. I am going to
> have
> > an extra glass of champagne to celebrate this at the upcoming New Year
> Eve
> > : ).
> >
> > There are still some flaky tests left — and there always will be, failure
> > is embedded into progress —, but it is not the flakiness I would like to
> > discuss today. I would like to draw your attention to the last number in
> > the gtest output lines above.
> >
> > When adding tests, we also contribute to the time it takes for a complete
> > test suite to run. There are multiple ways how we can keep this number
> > small (one is, heh, write less tests : ) ). Today I propose to focus on
> > reducing duration of individual test cases.
> >
> > Mesos tests are often build around certain sequences of events, some of
> > those have timeouts, some are dependent on other events. Naive test
> > implementations sometimes lead to test being blocked by the duration of
> > some timeout, pointlessly slowing down the whole suite! A good indicator
> of
> > such a test is that its duration is an integral number of seconds (the
> > timeout) plus some delta (actual testing code), for example 3123 ms, 5076
> > ms.
> >
> > Suggestion: If you write a new test, please look at the test duration as
> > well, if it seems unreasonably long, investigate what the reasons are and
> > how you can make the test faster.
> >
> > State of the art:
> >   * Slave recovery tests are known to be slow, see MESOS-733 [1].
> >   * Ben Mahler created an epic to track slow tests more than a year ago
> > (MESOS-1757 [2]) and did some work earlier (MESOS-297 [3]).
> >   * Dominic Hamon did pretty much what I have done (with a much nicer
> > command, too bad I noticed that after generating the list myself) and
> filed
> > MESOS-2059 [4].
> >
> > To get a list of suspect tests I ran `./bin/mesos-tests.sh 2>/dev/null |
> > grep "ms)"` and noted down tests that took more than 1 second to
> complete.
> > To my knowledge, 1s is the shortest timeout we use in default values for
> > configurable parameters.
> >
> > For each test from the list I either created a JIRA ticket, or grouped a
> > bunch of seemingly related tickets into an epic (details below). I
> hijacked
> > MESOS-1757 [2] and made it a parent for all newly created epics and
> > tickets.
> >
> > I would like to encourage folks to look at these tickets and work on them
> > when they have time and mood. Apart making `make check` faster, I believe
> > that most of these tickets are actually a very good way to familiarize
> > yourself with the Mesos codebase (hence I marked all tickets as
> > `newbie++`), so if you would like to contribute to Mesos but do not know
> > where to start — this can be a good choice!
> >
> > It is clear that some tickets are false positives and there exists a good
> > reason why this particular test takes longer than others. In this case a
> > comment explaining this reason is a proper resolution for the ticket.
> >
> > To avoid difficulties with finding a shepherd, I would suggest
> > investigating the test first, understanding the reason for the slowness,
> > and updating the ticket, so that a potential shepherd can easier estimate
> > the amount of time necessary for fixing the issue. Investigating does not
> > require a shepherd, and once it is done, all following steps (finding a
> > shepherd, submitting a patch, getting it committed) are trivial.
> >
> > I believe some tests may share the same root cause (for example, they
> rely
> > on the same timeout, which cannot be changed from the test harness). In
> > this case all such tests can be fixed by a single change.
> >
> > Below are the suspect tests.
> >   * Examples tests, slow since early days, see MESOS-297 [3]. Filed
> > MESOS-4155 [6].
> >   * Fetcher cache and fetcher cache http tests, filed MESOS-4156 [7].
> >   * Zookeeper tests, some are slow since early days, see MESOS-297 [3].
> > Filed MESOS-4157 [8].
> >   * Slave recovery tests. Known to be slow, see MESOS-733 [1] and
> MESOS-297
> > [3]. Filed MESOS-4158 [9].
> >   * Group tests, filed MESOS-4159 [10].
> >   * Recover tests, filed MESOS-4160 [11].
> >
> >   * SlaveTest.CommandExecutorWithOverride (1311 ms), filed MESOS-4161
> [12].
> >   * SlaveTest.MetricsSlaveLaunchErrors (1009 ms), filed MESOS-4162 [13].
> >   * SlaveTest.HTTPSchedulerSlaveRestart (2307 ms), filed MESOS-4163 [14].
> >   * MasterTest.RecoverResources (1018 ms), filed MESOS-4164 [15].
> >   * MasterTest.MasterInfoOnReElection (1024 ms), filed MESOS-4165 [16].
> >   * MasterTest.LaunchCombinedOfferTest (2023 ms), filed MESOS-4166 [17].
> >   * MasterTest.OfferTimeout (1053 ms), filed MESOS-4167 [18].
> >   * MasterAllocatorTest/0.SlaveLost (5076 ms). Allocator related test,
> > MESOS-3775 [5]. The tests waits 5s for an executor to terminate.
> >   * MasterMaintenanceTest.EnterMaintenanceMode (5087 ms), filed
> MESOS-4168
> > [19].
> >   * MasterMaintenanceTest.InverseOffers (2027 ms), filed MESOS-4169 [20].
> >   * OversubscriptionTest.UpdateAllocatorOnSchedulerFailover (1018 ms),
> > filed MESOS-4170 [21].
> >   * OversubscriptionTest.RemoveCapabilitiesOnSchedulerFailover (1018 ms),
> > filed MESOS-4171 [22].
> >   * GarbageCollectorIntegrationTest.Restart (5102 ms), filed MESOS-4172
> > [23].
> >   * HealthCheckTest.CheckCommandTimeout (15483 ms), filed MESOS-4173
> [24].
> >   * HookTest.VerifySlaveLaunchExecutorHook (5061 ms), filed MESOS-4174
> > [25].
> >   * ContentType/SchedulerTest.Decline/0 (1022 ms), filed MESOS-4175 [26].
> >
> > Thanks for reading this up till this point,
> > AlexR
> >
> >
> > [1] https://issues.apache.org/jira/browse/MESOS-733
> > [2] https://issues.apache.org/jira/browse/MESOS-1757
> > [3] https://issues.apache.org/jira/browse/MESOS-297
> > [4] https://issues.apache.org/jira/browse/MESOS-2059
> > [5] https://issues.apache.org/jira/browse/MESOS-3775
> > [6] https://issues.apache.org/jira/browse/MESOS-4155
> > [7] https://issues.apache.org/jira/browse/MESOS-4156
> > [8] https://issues.apache.org/jira/browse/MESOS-4157
> > [9] https://issues.apache.org/jira/browse/MESOS-4158
> > [10] https://issues.apache.org/jira/browse/MESOS-4159
> > [11] https://issues.apache.org/jira/browse/MESOS-4160
> > [12] https://issues.apache.org/jira/browse/MESOS-4161
> > [13] https://issues.apache.org/jira/browse/MESOS-4162
> > [14] https://issues.apache.org/jira/browse/MESOS-4163
> > [15] https://issues.apache.org/jira/browse/MESOS-4164
> > [16] https://issues.apache.org/jira/browse/MESOS-4165
> > [17] https://issues.apache.org/jira/browse/MESOS-4166
> > [18] https://issues.apache.org/jira/browse/MESOS-4167
> > [19] https://issues.apache.org/jira/browse/MESOS-4168
> > [20] https://issues.apache.org/jira/browse/MESOS-4169
> > [21] https://issues.apache.org/jira/browse/MESOS-4170
> > [22] https://issues.apache.org/jira/browse/MESOS-4171
> > [23] https://issues.apache.org/jira/browse/MESOS-4172
> > [24] https://issues.apache.org/jira/browse/MESOS-4173
> > [25] https://issues.apache.org/jira/browse/MESOS-4174
> > [26] https://issues.apache.org/jira/browse/MESOS-4175
> >
>
>
>
> --
> Deshi Xiao
> Twitter: xds2000
> E-mail: xiaods(AT)gmail.com
>

Reply via email to