On 16 Jun 2014 20:33, "Thierry Carrez" <thie...@openstack.org> wrote: > > Robert Collins wrote: > > [...] > > C - If we can't make it harder to get races in, perhaps we can make it > > easier to get races out. We have pretty solid emergent statistics from > > every gate job that is run as check. What if set a policy that when a > > gate queue gets a race: > > - put a zuul stop all merges and checks on all involved branches > > (prevent further damage, free capacity for validation) > > - figure out when it surfaced > > - determine its not an external event > > - revert all involved branches back to the point where they looked > > good, as one large operation > > - run that through jenkins N (e.g. 458) times in parallel. > > - on success land it > > - go through all the merges that have been reverted and either > > twiddle them to be back in review with a new patchset against the > > revert to restore their content, or alternatively generate new reviews > > if gerrit would make that too hard. > > One of the issues here is that "gate queue gets a race" is not a binary > state. There are always rare issues, you just can't find all the bugs > that happen 0.00001% of the time. You add more such issues, and at some > point they either add up to an unacceptable level, or some other > environmental situation suddenly increases the odds of some old rare > issue to happen (think: new test cluster with slightly different > performance characteristics being thrown into our test resources). There > is no single incident you need to find and fix, and during which you can > clearly escalate to defCon 1. You can't even assume that a "gate > situation" was created in the set of commits around when it surfaced. > > So IMHO it's a continuous process : keep looking into rare issues all > the time, to maintain them under the level where they become a problem. > You can't just have a specific process that kicks in when "the gate > queue gets a race
You seem to be drawing different conclusions here but the emergent behaviour is a shared model that we both have. In no part of my mail did I suggest ignoring issues until we hit Defcon one. I suggested that what we are doing is not working, and put forward a model to explain why it's not working ... one which to me seems to fit the evidence. And finally suggested a few different things which might help. For the specific scenario you raise that might not fit... Adding a test cluster is a change to our test config and certainly something we could revert. That's the benefit of configuration as code.
_______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev