Re: [Canonical-ci-engineering] Invitation: Sync with QA about automation tasks on the backlog @ Thu Nov 6, 2014 4pm - 4:30pm (Thomas Strehl)

Michi Henning Mon, 24 Nov 2014 07:48:41 -0800

Hi Francis,

thanks heaps for looking at this! From memory, I've also seen problems with 
cloud-worker-10 when building for unity-scopes-api-devel-ci.


If the nodes stop periodically, that would be perfectly consistent with the 
kinds of failures we are seeing, so it seems likely that you are on the right 
track.

Basically, for our tests to succeed, we need a machine that (roughly) provides 
the same performance as a phone. If things are four or five times slower than a 
phone, that's not a problem. But it's basically impossible for us to be 
resilient when things are 50 times slower or more. In some cases, that would 
slow the tests down intolerably and, in other cases, we would no longer be 
testing with any relevance to the real-world execution environment.

Anyway, thanks heaps for looking at this! We have seen similar issues in the 
past, but never to this degree. Is there a way to install some sort of watchdog 
process that can alert you to this problem? From our end, when a test fails on 
Jenkins, it is *very* difficult to establish that something on Jenkins is at 
fault. We tend to blame ourselves first. If the actual cause is something on 
Jenkins, it means that we have spent many hours trying to find a fault in our 
code, only to find out that we were chasing ghosts.

So, some sort of benchmarking process or something that runs periodically and 
establishes that a test machine delivers expected performance might help?

Cheers,

Michi.


On 6 Nov 2014, at 14:46 , Francis Ginther <[email protected]> wrote:

> Michi,
> 
> There is something bad happening on at two of the cloud-worker nodes, 03 and 
> 09, and I've disabled them until we can take a closer look. These two nodes 
> were used for all of the failures that I found when looking back through the 
> prior test runs except for a few that occurred before Oct 24.
> 
> I've poked around enough to see that user space processes are occasionally 
> being held off for periods of 5-10 seconds which syncs to the description of 
> the problem Michi provided. A closer look will have to wait until tomorrow.
> 
> Francis
> 
> On Wed, Nov 5, 2014 at 2:18 PM, Thomi Richards <[email protected]> 
> wrote:
> Hi everyone,
> 
> 
> This is a query for the CI team. I've CCed their mailing list in the reply, 
> but for anything infrastructure-related, you should be talking to someone on 
> #ci (canonical IRC server) or #ubuntu-ci-eng (freenode IRC server).
> 
> CI team: the unity APIs team are seeing test failures since 24th October on 
> i386 test runners. Could someone please advice them on what they can do to 
> get this resolved? (see below)
> 
> 
> Cheers!
> 
> On Wed, Nov 5, 2014 at 9:14 PM, Michi Henning <[email protected]> 
> wrote:
> >
> > I am not aware of anything. You are talking about unit tests, right?
> > Can you please link to one of such failures?
> 
> Hi Leo,
> 
> basically, the story is that we have unit tests failing on Jenkins. The 
> problems started on 24 October (as best I can tell), and they *only* strike 
> for the i386 builds. Amd64 and Arm always succeed.
> 
> The tests that fail include tests that have *never* (yes, literally never) 
> failed before, not on any of our desktops, not on the phone, not on anything, 
> including Jenkins. Suddenly, they are failing by the bucket load (yes, I 
> really mean bucket load). There is a pattern in the failures: every single 
> failure relates to tests that, basically, do something, wait for a while, and 
> then check that whatever is supposed to happen has actually happened.
> 
> The tests are very tolerant in terms of the timing thresholds, so it's not as 
> if we are waiting for something that normally takes 1 ms and then fail if it 
> hasn't happened after 5 ms. The failures we are talking about are all in the 
> 500 ms and greater range. For example, we have seen a test failure where we 
> exec a simple process that, once it is started, returns a message. Normally 
> (even on the phone), that takes about 120 ms. We wait for 4 seconds for that 
> test and fail if the message doesn't reach us within that time.
> 
> We also see failures in a test that runs two threads, one of which does 
> something periodically, and the other one waits for the worker thread to 
> complete certain tasks. This test has never failed anywhere, and has 
> succeeded unchanged for i386 for at least the last six months. Since 24 
> October, we are seeing it fail regularly. It is absolutely certain that the 
> problem is not with the test (in the sense that there might be a race 
> condition or some such). The test runs cleanly with valgrind, helgrind, 
> thread sanitizer, address sanitizer, etc., and we wait for half a second for 
> something to happen that takes a microsecond to do, and there are no other 
> threads busy in the test. Yet, the one runnable thread that does the job does 
> not run for half a second.
> 
> There are dozens of tests that are (more or less randomly) affected. 
> Sometimes this blows up, sometimes that… The failure pattern we are seeing is 
> consistent with either a heavily (as in very heavily) loaded machine, or some 
> problem with thread scheduling, where threads that are runnable get delayed 
> on the order of a second or more.
> 
> In summary, everything I'm seeing points to some issue on Jenkins i386, 
> because the failures don't happen anywhere else, and happen for tests that 
> (unchanged) have succeeded on Jenkins hundreds of times prior to 24 October.
> 
> Is there a way to figure what it is going on in the Jenkins infrastructure? 
> For example, if the Jenkins build tells me that it is happening on 
> cloud-worker-10, is there a way for me to figure out what physical machine 
> that corresponds to, and what the load on that machine is at the time? I 
> strongly suspect that the problems are either due to the build machine trying 
> to do more than it can, or possibly due to I/O virtualization? (That second 
> guess may well be wrong, seeing that all our comms run over the backplane via 
> Unix domain sockets.)
> 
> If you want to see some of the failures, a look through the recent build 
> history for unity-scopes-api-devel-ci and unity-scopes-api-devel-autolanding 
> shows plenty of failed test runs. The failures will probably not mean much to 
> you without knowing our code. But, the upshot is that, for every single one 
> of them, the failure is caused by something taking orders of magnitude (as in 
> 100-1000 times) longer than what is reasonable.
> 
> Thanks,
> 
> Michi.
> 
> 
> 
> -- 
> Thomi Richards
> [email protected]
> 
> --
> Mailing list: https://launchpad.net/~canonical-ci-engineering
> Post to     : [email protected]
> Unsubscribe : https://launchpad.net/~canonical-ci-engineering
> More help   : https://help.launchpad.net/ListHelp
> 
> 
> 
> 
> -- 
> Francis Ginther
> Canonical - Ubuntu Engineering - Continuous Integration Team

-- 
Mailing list: https://launchpad.net/~canonical-ci-engineering
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~canonical-ci-engineering
More help   : https://help.launchpad.net/ListHelp

Re: [Canonical-ci-engineering] Invitation: Sync with QA about automation tasks on the backlog @ Thu Nov 6, 2014 4pm - 4:30pm (Thomas Strehl)

Reply via email to