Michi, There is something bad happening on at two of the cloud-worker nodes, 03 and 09, and I've disabled them until we can take a closer look. These two nodes were used for all of the failures that I found when looking back through the prior test runs except for a few that occurred before Oct 24.
I've poked around enough to see that user space processes are occasionally being held off for periods of 5-10 seconds which syncs to the description of the problem Michi provided. A closer look will have to wait until tomorrow. Francis On Wed, Nov 5, 2014 at 2:18 PM, Thomi Richards <[email protected] > wrote: > Hi everyone, > > > This is a query for the CI team. I've CCed their mailing list in the > reply, but for anything infrastructure-related, you should be talking to > someone on #ci (canonical IRC server) or #ubuntu-ci-eng (freenode IRC > server). > > CI team: the unity APIs team are seeing test failures since 24th October > on i386 test runners. Could someone please advice them on what they can do > to get this resolved? (see below) > > > Cheers! > > On Wed, Nov 5, 2014 at 9:14 PM, Michi Henning <[email protected] > > wrote: > >> > >> > I am not aware of anything. You are talking about unit tests, right? >> > Can you please link to one of such failures? >> >> Hi Leo, >> >> basically, the story is that we have unit tests failing on Jenkins. The >> problems started on 24 October (as best I can tell), and they *only* strike >> for the i386 builds. Amd64 and Arm always succeed. >> >> The tests that fail include tests that have *never* (yes, literally >> never) failed before, not on any of our desktops, not on the phone, not on >> anything, including Jenkins. Suddenly, they are failing by the bucket load >> (yes, I really mean bucket load). There is a pattern in the failures: every >> single failure relates to tests that, basically, do something, wait for a >> while, and then check that whatever is supposed to happen has actually >> happened. >> >> The tests are very tolerant in terms of the timing thresholds, so it's >> not as if we are waiting for something that normally takes 1 ms and then >> fail if it hasn't happened after 5 ms. The failures we are talking about >> are all in the 500 ms and greater range. For example, we have seen a test >> failure where we exec a simple process that, once it is started, returns a >> message. Normally (even on the phone), that takes about 120 ms. We wait for >> 4 seconds for that test and fail if the message doesn't reach us within >> that time. >> >> We also see failures in a test that runs two threads, one of which does >> something periodically, and the other one waits for the worker thread to >> complete certain tasks. This test has never failed anywhere, and has >> succeeded unchanged for i386 for at least the last six months. Since 24 >> October, we are seeing it fail regularly. It is absolutely certain that the >> problem is not with the test (in the sense that there might be a race >> condition or some such). The test runs cleanly with valgrind, helgrind, >> thread sanitizer, address sanitizer, etc., and we wait for half a second >> for something to happen that takes a microsecond to do, and there are no >> other threads busy in the test. Yet, the one runnable thread that does the >> job does not run for half a second. >> >> There are dozens of tests that are (more or less randomly) affected. >> Sometimes this blows up, sometimes that… The failure pattern we are seeing >> is consistent with either a heavily (as in very heavily) loaded machine, or >> some problem with thread scheduling, where threads that are runnable get >> delayed on the order of a second or more. >> >> In summary, everything I'm seeing points to some issue on Jenkins i386, >> because the failures don't happen anywhere else, and happen for tests that >> (unchanged) have succeeded on Jenkins hundreds of times prior to 24 October. >> >> Is there a way to figure what it is going on in the Jenkins >> infrastructure? For example, if the Jenkins build tells me that it is >> happening on cloud-worker-10, is there a way for me to figure out what >> physical machine that corresponds to, and what the load on that machine is >> at the time? I strongly suspect that the problems are either due to the >> build machine trying to do more than it can, or possibly due to I/O >> virtualization? (That second guess may well be wrong, seeing that all our >> comms run over the backplane via Unix domain sockets.) >> >> If you want to see some of the failures, a look through the recent build >> history for unity-scopes-api-devel-ci and >> unity-scopes-api-devel-autolanding shows plenty of failed test runs. The >> failures will probably not mean much to you without knowing our code. But, >> the upshot is that, for every single one of them, the failure is caused by >> something taking orders of magnitude (as in 100-1000 times) longer than >> what is reasonable. >> >> Thanks, >> >> Michi. > > > > > -- > Thomi Richards > [email protected] > > -- > Mailing list: https://launchpad.net/~canonical-ci-engineering > Post to : [email protected] > Unsubscribe : https://launchpad.net/~canonical-ci-engineering > More help : https://help.launchpad.net/ListHelp > > -- Francis Ginther Canonical - Ubuntu Engineering - Continuous Integration Team
-- Mailing list: https://launchpad.net/~canonical-ci-engineering Post to : [email protected] Unsubscribe : https://launchpad.net/~canonical-ci-engineering More help : https://help.launchpad.net/ListHelp

