+1 2015-12-16 2:15 GMT+08:00 Alex Rukletsov <[email protected]>:
> Folks, > > I would like to share some facts and thoughts about tests. When I ran `make > check -j7` on my Mac OS machine the other day, gtest reported the following > (your numbers may vary depending on the OS you're on and filters you use): > [==========] 882 tests from 117 test cases ran. (298610 ms total) > > Same command for Mesos 0.21.1, which has been released around a year ago, > yields > [==========] 452 tests from 71 test cases ran. (196398 ms total) > > We almost doubled the number of tests in 2015. I think this is a great > achievement per se, moreover it makes the life of cluster operators, > release managers, and Mesos contributors less stressful. I am going to have > an extra glass of champagne to celebrate this at the upcoming New Year Eve > : ). > > There are still some flaky tests left — and there always will be, failure > is embedded into progress —, but it is not the flakiness I would like to > discuss today. I would like to draw your attention to the last number in > the gtest output lines above. > > When adding tests, we also contribute to the time it takes for a complete > test suite to run. There are multiple ways how we can keep this number > small (one is, heh, write less tests : ) ). Today I propose to focus on > reducing duration of individual test cases. > > Mesos tests are often build around certain sequences of events, some of > those have timeouts, some are dependent on other events. Naive test > implementations sometimes lead to test being blocked by the duration of > some timeout, pointlessly slowing down the whole suite! A good indicator of > such a test is that its duration is an integral number of seconds (the > timeout) plus some delta (actual testing code), for example 3123 ms, 5076 > ms. > > Suggestion: If you write a new test, please look at the test duration as > well, if it seems unreasonably long, investigate what the reasons are and > how you can make the test faster. > > State of the art: > * Slave recovery tests are known to be slow, see MESOS-733 [1]. > * Ben Mahler created an epic to track slow tests more than a year ago > (MESOS-1757 [2]) and did some work earlier (MESOS-297 [3]). > * Dominic Hamon did pretty much what I have done (with a much nicer > command, too bad I noticed that after generating the list myself) and filed > MESOS-2059 [4]. > > To get a list of suspect tests I ran `./bin/mesos-tests.sh 2>/dev/null | > grep "ms)"` and noted down tests that took more than 1 second to complete. > To my knowledge, 1s is the shortest timeout we use in default values for > configurable parameters. > > For each test from the list I either created a JIRA ticket, or grouped a > bunch of seemingly related tickets into an epic (details below). I hijacked > MESOS-1757 [2] and made it a parent for all newly created epics and > tickets. > > I would like to encourage folks to look at these tickets and work on them > when they have time and mood. Apart making `make check` faster, I believe > that most of these tickets are actually a very good way to familiarize > yourself with the Mesos codebase (hence I marked all tickets as > `newbie++`), so if you would like to contribute to Mesos but do not know > where to start — this can be a good choice! > > It is clear that some tickets are false positives and there exists a good > reason why this particular test takes longer than others. In this case a > comment explaining this reason is a proper resolution for the ticket. > > To avoid difficulties with finding a shepherd, I would suggest > investigating the test first, understanding the reason for the slowness, > and updating the ticket, so that a potential shepherd can easier estimate > the amount of time necessary for fixing the issue. Investigating does not > require a shepherd, and once it is done, all following steps (finding a > shepherd, submitting a patch, getting it committed) are trivial. > > I believe some tests may share the same root cause (for example, they rely > on the same timeout, which cannot be changed from the test harness). In > this case all such tests can be fixed by a single change. > > Below are the suspect tests. > * Examples tests, slow since early days, see MESOS-297 [3]. Filed > MESOS-4155 [6]. > * Fetcher cache and fetcher cache http tests, filed MESOS-4156 [7]. > * Zookeeper tests, some are slow since early days, see MESOS-297 [3]. > Filed MESOS-4157 [8]. > * Slave recovery tests. Known to be slow, see MESOS-733 [1] and MESOS-297 > [3]. Filed MESOS-4158 [9]. > * Group tests, filed MESOS-4159 [10]. > * Recover tests, filed MESOS-4160 [11]. > > * SlaveTest.CommandExecutorWithOverride (1311 ms), filed MESOS-4161 [12]. > * SlaveTest.MetricsSlaveLaunchErrors (1009 ms), filed MESOS-4162 [13]. > * SlaveTest.HTTPSchedulerSlaveRestart (2307 ms), filed MESOS-4163 [14]. > * MasterTest.RecoverResources (1018 ms), filed MESOS-4164 [15]. > * MasterTest.MasterInfoOnReElection (1024 ms), filed MESOS-4165 [16]. > * MasterTest.LaunchCombinedOfferTest (2023 ms), filed MESOS-4166 [17]. > * MasterTest.OfferTimeout (1053 ms), filed MESOS-4167 [18]. > * MasterAllocatorTest/0.SlaveLost (5076 ms). Allocator related test, > MESOS-3775 [5]. The tests waits 5s for an executor to terminate. > * MasterMaintenanceTest.EnterMaintenanceMode (5087 ms), filed MESOS-4168 > [19]. > * MasterMaintenanceTest.InverseOffers (2027 ms), filed MESOS-4169 [20]. > * OversubscriptionTest.UpdateAllocatorOnSchedulerFailover (1018 ms), > filed MESOS-4170 [21]. > * OversubscriptionTest.RemoveCapabilitiesOnSchedulerFailover (1018 ms), > filed MESOS-4171 [22]. > * GarbageCollectorIntegrationTest.Restart (5102 ms), filed MESOS-4172 > [23]. > * HealthCheckTest.CheckCommandTimeout (15483 ms), filed MESOS-4173 [24]. > * HookTest.VerifySlaveLaunchExecutorHook (5061 ms), filed MESOS-4174 > [25]. > * ContentType/SchedulerTest.Decline/0 (1022 ms), filed MESOS-4175 [26]. > > Thanks for reading this up till this point, > AlexR > > > [1] https://issues.apache.org/jira/browse/MESOS-733 > [2] https://issues.apache.org/jira/browse/MESOS-1757 > [3] https://issues.apache.org/jira/browse/MESOS-297 > [4] https://issues.apache.org/jira/browse/MESOS-2059 > [5] https://issues.apache.org/jira/browse/MESOS-3775 > [6] https://issues.apache.org/jira/browse/MESOS-4155 > [7] https://issues.apache.org/jira/browse/MESOS-4156 > [8] https://issues.apache.org/jira/browse/MESOS-4157 > [9] https://issues.apache.org/jira/browse/MESOS-4158 > [10] https://issues.apache.org/jira/browse/MESOS-4159 > [11] https://issues.apache.org/jira/browse/MESOS-4160 > [12] https://issues.apache.org/jira/browse/MESOS-4161 > [13] https://issues.apache.org/jira/browse/MESOS-4162 > [14] https://issues.apache.org/jira/browse/MESOS-4163 > [15] https://issues.apache.org/jira/browse/MESOS-4164 > [16] https://issues.apache.org/jira/browse/MESOS-4165 > [17] https://issues.apache.org/jira/browse/MESOS-4166 > [18] https://issues.apache.org/jira/browse/MESOS-4167 > [19] https://issues.apache.org/jira/browse/MESOS-4168 > [20] https://issues.apache.org/jira/browse/MESOS-4169 > [21] https://issues.apache.org/jira/browse/MESOS-4170 > [22] https://issues.apache.org/jira/browse/MESOS-4171 > [23] https://issues.apache.org/jira/browse/MESOS-4172 > [24] https://issues.apache.org/jira/browse/MESOS-4173 > [25] https://issues.apache.org/jira/browse/MESOS-4174 > [26] https://issues.apache.org/jira/browse/MESOS-4175 > -- Deshi Xiao Twitter: xds2000 E-mail: xiaods(AT)gmail.com
