I thought about that incident a bit more and there are several points I'd like to discuss.
AIUI, otto requires an up-to-date graphic driver in the kernel. This implies running the latest release on the host so the lxc container can use it. The container itself use a snapshot to run the tests to guarantee isolation. So far so good. But the failure we saw in the incident highlights the weakness in the model: if for some reason the system running on the host is broken by an update, no more tests can be run. So first, we need to have a check (or several) that the host provides the right base for the container. Then we need to decide whether we accept the risk to get a broken *host* or if we need a pre-requisite test suite to accept such upgrades. A first line of defense could be to have some smoke runs after an upgrade to ensure the host system is still usable. An alternative would be to have a dual-boot so we can experiment on one boot without breaking the production one. Another alternative would be to run precise (for consistency with other physical hosts we manage (or raring or whatever is stable enough) and have a kvm into which we can run the latest release and build the lxc only there. I'm not sure we can give access to the graphic card this way though. Which brings another point: do we really need to run all tests against all graphic drivers (currently intel and nvidia, ati being on hold ?). Or can we just use them all as a pool to spread the load and consider that the tests are valid if they pass on any of them (we'll get some validation on all of them in the end just by running newer jobs). Francis started to reply to that on IRC saying: <fginther> vila, This should be revisited, but I defer to the unity team (bregma) on wether or not that is still needed. In any case, I'd like these points to be discussed because this incident can happen again, we're still fully exposed and having to get back to that kind of questions from a failing autopilot test takes time and is not (IMHO) the output expected from a ci system ;) Ideally for this case a single would have failed pointing to an issue in either the lxc or the host. A very coarse test but they are usually easier to start with, they can be refined when new failures appear. I realize this is a bit of brain dump, but we encountered a simpler (but similar) issue with sst/selenium/firefox (sst automates website testing by piloting a browser). For pay and sso, some sst tests were designed to validate new pay and sso server releases. sst is implemented on top of selenium which itself guarantee the compatibility with firefox. When firefox releases a new version, selenium often has to catch up and release a new version too. There were gaps between ff and selenium releases during which the sst tests were failing: wrong output. The solution we came up with was to run a different job with the upcoming firefox version so we got early warnings that selenium will break. This never was fully implemented but the above describes the principle enough (ask for clarifications otherwise ;): critical (for the ci engine) updates should be staged. The otto jobs do not stage these updates (and may that's not possible since they also need to validate the graphic driver). But the point remains: the jobs should not be able to break the ci engine and deliver misleading results. And while https://wiki.canonical.com/UbuntuEngineering/CI/IncidentLog/2013-10-16 has not been fully diagnosed, it's another case whether a job is directly involved in breaking a part of the ci engine. What do you think ? Vincent -- Mailing list: https://launchpad.net/~canonical-ci-engineering Post to : [email protected] Unsubscribe : https://launchpad.net/~canonical-ci-engineering More help : https://help.launchpad.net/ListHelp

