You know its bad when you can't sleep because you're redesigning gate workflows in your head.... so I apologise that this email is perhaps not as rational, nor as organised, as usual - but , ^^^^. :)
Obviously this is very important to address, and if we can come up with something systemic I'm going to devote my time both directly, and via resource-hunting within HP, to address it. And accordingly I'm going to feel free to say 'zuul this' with no regard for existing features. We need to get ahead of the problem and figure out how to stay there, and I think below I show why the current strategy just won't do that. On 13 June 2014 06:08, Sean Dague <[email protected]> wrote: > We're hitting a couple of inflection points. > > 1) We're basically at capacity for the unit work that we can do. Which > means it's time to start making decisions if we believe everything we > currently have running is more important than the things we aren't > currently testing. > > Everyone wants multinode testing in the gate. It would be impossible to > support that given current resources. How much of our capacity problems are due to waste - such as: - tempest runs of code the author knows is broken - tempest runs of code that doesn't pass unit tests - tempest runs while the baseline is unstable - to expand on this one, if master only passes one commit in 4, no check job can have a higher success rate overall. Vs how much are an indication of the sheer volume of development being done? > 2) We're far past the inflection point of people actually debugging jobs > when they go wrong. > > The gate is backed up (currently to 24hrs) because there are bugs in > OpenStack. Those are popping up at a rate much faster than the number of > people who are willing to spend any time on them. And often they are > popping up in configurations that we're not all that familiar with. So, I *totally* appreciate that people fixing the jobs is the visible expendable resource, but I'm not sure its the bottleneck. I think the bottleneck is our aggregate ability to a) detect the problem and b) resolve it. For instance - strawman - if when the gate goes bad, after a check for external issues like new SQLAlchemy releases etc, what if we just rolled trunk of every project that is in the integrated gate back to before the success rate nosedived ? I'm well aware of the DVCS issues that implies, but from a human debugging perspective that would massively increase the leverage we get from the folk that do dive in and help. It moves from 'figure out that there is a problem and it came in after X AND FIX IT' to 'figure out it came in after X'. Reverting is usually much faster and more robust than rolling forward, because rolling forward has more unknowns. I think we have a systematic problem, because this situation happens again and again. And the root cause is that our time to detect races/nondeterministic tests is a probability function, not a simple scalar. Sometimes we catch such tests within one patch in the gate, sometimes they slip through. If we want to land hundreds or thousands of patches a day, and we don't want this pain to happen, I don't see any way other than *either*: A - not doing this whole gating CI process at all B - making detection a whole lot more reliable (e.g. we want near-certainty that a given commit does not contain a race) C - making repair a whole lot faster (e.g. we want <= one test cycle in the gate to recover once we have determined that some commit is broken. Taking them in turn: A - yeah, no. We have lots of experience with the axiom that that which is not tested is broken. And thats the big concern about removing things from our matrix - when they are not tested, we can be sure that they will break and we will have to spend neurons fixing them - either directly or as reviews from people fixing it. B - this is really hard. Say we want quite sure sure that there are no new races that will occur with more than some probability in a given commit, and we assume that race codepaths might be run just once in the whole test matrix. A single test run can never tell us that - it just tells us it worked. What we need is some N trials where we don't observe a new race (but may observe old races), given a maximum risk of the introduction of a (say) 5% failure rate into the gate. [check my stats] (1-max risk)^trials = margin-of-error 0.95^N = 0.01 log(0.01, base=0.95) = N N ~= 90 So if we want to stop 5% races landing, and we may exercise any given possible race code path a minimum of 1 times in the test matrix, we need to exercise the whole test matrix 90 times to have that 1% margin sure we saw it. Raise that to a 1% race: log(0.01. base=0.99) = 458 Thats a lot of test runs. I don't think we can do that for each commit with our current resources - and I'm not at all sure that asking for enough resources to do that makes sense. Maybe it does. Data point - our current risk, with 1% margin: (1-max risk)^1 = 0.01 99% (that is, a single passing gate run will happily let through races with any amount of fail, given enough trials). In fact, its really just a numbers game for us at the moment - and we keep losing. B1. We could change our definition from a per-commit basis to instead saying 'within a given number of commits we want the probability of a new race to be low' - amortise the cost of gaining lots of confidence over more commits. It might work something like: - run regular gate runs of things of deeper and deeper zuul refs - failures eject single commits as usual - don't propogate successes. - keep going until have 100 commits all validated but not propogated, *or* more than (some time window, lets say 5 hours) has passed - start 500 test runs of all those commits, in parallel - if it fails, eject the whole window - otherwise let it in. This might let races in individual commits within the window through if and only if they are also fixed within the same window; coarse failures like basic API incompatibility or failure to use deps right would be detected as they are today. There's obviously room for speculative execution on the whole window in fact: run 600 jobs, 100 the zuul ref build-up and 500 the confidence interval builder. The downside of this approach is that there is a big window (because its amortising a big expense) which will all go in together, or not at all. And we'd have to prevent *all those commits* from being resubmitted until the cause of the failure was identified and actively fixed. We'd want that to be enforced, not run on the honour system, because any of those commits can bounce the whole set out. The flip side is that it would be massively more effective at keeping bad commits out. B2. ??? I had some crazy idea of multiple branches with more and more confidence in them, but I think they all actually boil down to variations on a them of B1, and if we move the centre of developer mass to $wherever, the gate for that is where the pain will be felt. C - If we can't make it harder to get races in, perhaps we can make it easier to get races out. We have pretty solid emergent statistics from every gate job that is run as check. What if set a policy that when a gate queue gets a race: - put a zuul stop all merges and checks on all involved branches (prevent further damage, free capacity for validation) - figure out when it surfaced - determine its not an external event - revert all involved branches back to the point where they looked good, as one large operation - run that through jenkins N (e.g. 458) times in parallel. - on success land it - go through all the merges that have been reverted and either twiddle them to be back in review with a new patchset against the revert to restore their content, or alternatively generate new reviews if gerrit would make that too hard. Getting folk to help ============== On the social side there is currently very little direct signalling that the gate is in trouble : I don't mean there is no communication - there's lots. What I mean is that Fred, a developer not on the lists or IRC for whatever reason, pushing code up, has no signal until they go 'why am I not getting check results', visit the status page and go 'whoa'. Maybe we can do something about that. For instance, when a gate is in trouble, have zuul not schedule check jobs at all, and refuse to do rechecks / revalidates in affected branches, unless the patch in question is a partial-bug: or bug: for one of the gate bugs. Zuul can communicate the status on the patch, so the developer knows. This will: - free up capacity for testing whatever fix is being done for the issue - avoid waste, since we know there is a high probability of spurious failures - provide a clear signal that the project expectation is that when the gate is broken, fixing it is the highest priority > Landing a gating job comes with maintenance. Maintenance in looking into > failures, and not just running recheck. So there is an overhead to > testing this many different configurations. > > I think #2 is just as important to realize as #1. As such I think we > need to get to the point where there are a relatively small number of > configurations that Infra/QA support, and beyond that every job needs > sponsors. And if the job success or # of uncategorized fails go past > some thresholds, we demote them to non-voting, and if you are non-voting > for > 1 month, you get demoted to experimental (or some specific > timeline, details to be sorted). -- Robert Collins <[email protected]> Distinguished Technologist HP Converged Cloud _______________________________________________ OpenStack-dev mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
