Nice write up. Idea by Drew Wilson: 5. Make test shell dump (partial) results when there is a timeout. (This may actually be an item under "2".)
On Wed, Sep 9, 2009 at 10:51 AM, Ojan Vafai <[email protected]> wrote: > I'm including at the top concrete tasks people can take to help identify > and reduce flakiness. Read below for more details. > > 1. Mark slow tests as SLOW and reduce the timeout on the bots to 2 > seconds. > 2. Look into the cause of the timeouts on HTTP tests, especially on > Mac/Windows > 3. Look at the actual results off the bots for the non-timeout flaky > failures and identify the cause of the flakiness (likely the test itself). > 4. Make test_expectations.txt match what's actually happening on the > bots (see the flakiness dashboard for tests with incorrect expectations). > > All the data I use below is from: > http://src.chromium.org/viewvc/chrome/trunk/src/webkit/tools/layout_tests/flakiness_dashboard.html > > On Tue, Sep 8, 2009 at 5:52 PM, David Levin <[email protected]> wrote: > >> I agree that the chromium buildbot seems to have more flakiness on layout >> tests that webkit buildbots. > > > While there is definitely more flakiness, I'm not sure how much more. I > think the Chromium bots are primarily more flaky on the HTTP tests. What > flakiness there is gets less noticed on the webkit buildbots since they > don't close the tree. > > >> Here's two things that may help us to understand this: >> 1. It would be nice to save crash logs from OSX into the zip file. For >> example, this run >> >> http://build.chromium.org/buildbot/waterfall/builders/Webkit%20Mac10.5%20(dbg)(2)/builds/3323/steps/webkit_tests/logs/stdio >> had a crash and likely generated a crash log at >> ~/Library/Logs/CrashReporter/TestShell*.crash which would help point to a >> culprit. >> > > +1 This would be very useful. That said, it won't benefit with decreasing > flakiness much. Very few of the flaky tests are flaky crashers. They're > almost entirely flaky timeouts or failures, even in debug builders. > > 2. If we suspect that tests may pass if given more time, then increase the >> timeout and see if more tests pass but exceed this old timeout (log >> something when this happens so we can validate that it is working). >> > > -1 The test dashboard prints the out the amount of time a test takes to run > if it takes >1 second. I don't think the timing out tests would pass if we > just gave them more time. Specifically, there are tests that always timeout > and there are flaky timeout tests. The flaky timeout tests, when they do > pass, consistently take less than 10 seconds to run, most of them take less > than 1 second. > > Increasing the test timeout also *considerably* increases how long it takes > for the bots to cycle. In fact, I think we should be *decreasing* it to > something like 2 seconds. This would actually shave minutes off of the > current bot cycle times. > > We have ~100 tests that are slow, many of which timeout at 20 seconds. We > should mark all the slow, but passing tests as SLOW in the test expectations > file. This will give them more time to run than the other tests. Then we > should bring the timeout down to something like 2 seconds. This will make > the bots run a lot faster and distinguish between the tests that timeout > versus just taking a long time to pass. > > >> On Tue, Sep 8, 2009 at 5:41 PM, Dirk Pranke <[email protected]> wrote: >> >>> From what I've poked around at, many of the LayoutTest flaky failures >>> are timeout-related. >> >> > While more than half of the flaky tests on Windows and Mac are timeouts, > many of them are crashes or failures. You can see this pretty clearly on the > layout test dashboard. I'll note that on Linux, a very small percentage of > the flakiness is timeouts. Almost all of these timeouts on Windows/Mac are > HTTP tests. There is likely one or two causes for all the flakiness with the > HTTP tests. > > There's something in the test harness and web >>> server configurations that cause tests to be unpredictably slower. I >>> don't think Apple has this problem, and I think that's because they >>> use the built in apache instance in OS X, >> >> > We switched away from apache to lighttp because of flakiness it was causing > on cygwin (cygwin and apache don't play well together). Maybe it makes sense > to use lighttp on Windows and Apache on Mac? I think we should identify the > cause of the flakiness on Windows. Fixing that might fix the flakiness on > Mac as well and we wouldn't need to support two http servers. > > >> and also because they have a >>> very different model for test execution (how we run tests in >>> parallel). >> >> > Running tests in parallel did seem to make things a bit more flaky, but not > much. I haven't verified this, but I think it probably just magnified > existing flakiness by putting higher load on the machine. Linux, the least > flaky bot, is the only bot that has 4 cores instead of just 2, which means > it runs using more TestShell instances in parallel. > --~--~---------~--~----~------------~-------~--~----~ Chromium Developers mailing list: [email protected] View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~----------~----~----~----~------~----~------~--~---
