"Kevin L. Mitchell" <kevin.mitch...@rackspace.com> writes: > One of the things that's really bugging me these days is transient > failures, such as the inability to download a package, causing a gate > job to fail. It seems to me that we can distinguish "test failure" from > "environment build failure" easily enough, and automatically retry in > the latter case. Is this possible in practice with our current CI > infrastructure?
Yes, that's certainly been a big annoyance lately. That's a good suggestion, though there are a couple of things that make it not straightforward: jenkins doesn't have a facility to easily express (through some means such as an exit code) that a job has had anything other than a simple success/failure outcome; I believe that's an open feature request with jenkins. Even if we worked around that, for better or worse since we started using virtualenv's instead of packages, a lot of what we're testing now includes things like dependencies, configuration, installation, and other items that are ancillary to unit tests themselves. If a change adds "blorgh==1.0" to pip-requires, is the inability to install that a transient or permanent error? These may be solvable problems, but they'll take some engineering effort, and I have some ideas of where we may get a better return for our work. Most of the transient failures can be attributed to two causes: failures downloading packages, and failures connecting to gerrit. Monty has been working on a pypi mirror setup so that we can be responsible for ensuring that all of the python packages that pip needs to install are available to the jenkins slaves. We had hoped that simply adding a mirror would be enough, but as long as pip knows about both pypi.openstack.org and pypi.python.org, it will end up crawling the web pages of projects listed in the pypi mirror looking for new versions. So to really get to the point where we can run jobs with no unnecessary network dependencies, we have to be sure that our pypi mirror has every package needed, including when new dependencies are added. At the design summit, it was decided that we should move to a global list of dependencies for OpenStack -- with that in place, it should be easy to maintain the package inventory for our pypi mirror -- we can update the mirror when changes to the global dependency list are merged. However, that work seems to be stalled: https://bugs.launchpad.net/openstack-ci/+bug/995607 A reason we've seen even more errors downloading packages in recent weeks is that there have been some flaws with our pypi mirror implementation. Monty has been working this weekend to rectify those, so hopefully we'll see a significant drop in these errors when that is finished. And finally, as we've increased the number of builds jenkins runs (in order to run tests on new patchsets when they are uploaded, as well as run merge gates in parallel (which sometimes requires multiple runs of tests)) we have increased the load on the gerrit server which occasionally results in transient errors. Tuning gerrit is a bit of a black art; there's plenty of capacity on the server, but I believe further tuning is going to require a bit more instrumentation than we have now. Clark Boylan has been working on adding Java Melody to gerrit to help with that, so I hope we can get a handle on that soon. In the mean time, we have some ideas about how to work around that (retry with exponential backoff in the git scripts that jenkins uses, or cloning directly from a git repo instead of via gerrit). So with all that background, I think we should discuss the following at the CI team meeting on Tuesday: 1) What's the status of the global dependency list? Can we update: https://bugs.launchpad.net/openstack-ci/+bug/995607 Can we get it implemented in a reasonable amount of time to address these other issues (perhaps a couple of weeks)? 2) If not, can we make the pypi mirror be the only source of python packages for jenkins sooner? When we used pip bundles with tox, we set up the jobs to use the bundle unless there was a change to a -requires file. Could we do something similar and make pypi.openstack.org the only pip mirror unless there is a dependency change being tested? 3) Decide on a course of action to mitigate failures from transient gerrit errors (but continue to work on eliminating them in the first place). 4) Decide how to implement retriggering with Zuul. It's my very strong belief that our build systems should be robust enough that we don't need to retrigger jobs because of transient failures. It is not a good use of the time of busy and skilled developers to babysit jenkins jobs and retry them if the fail. So I think our priority should always be eliminating the causes of those failures, which is why I listed items 1-3 above in that order. However, there are always likely to be new causes for transient failures, and while we work on correcting them, we shouldn't make retrying builds any harder than they need to be. We have a couple of suggestions as to how to implement that in Zuul. It should be easy to do, we just need to think through some user experience items. So, in short, the recent badness with transient failures sucks, but I think we have some productive avenues we can take to get to a much better place soon. -Jim _______________________________________________ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp