On 14 November 2013 16:15, Joe Gordon <joe.gord...@gmail.com> wrote: > Hi All, > > TL;DR: Failure rate for gate jobs in graphite http://tinyurl.com/mqju53r ... > In short, even tiny bugs in gate have a major impact on the stability of > gate! And as we grow the number of integrated projects and increase the > number of tests this pattern will only get worse.
Thanks for the analysis! I have two comments (yes, only two!) Firstly, 5% isn't a tiny bug. It's a huge bug. We're doing thousands of runs a day. A tiny bug IMO 0.01% occurrence rate or less. Lets recalibrate our head around failure rates: a 0.01% failure in a 10K node cloud doing deploys once a day will happen every day (on average :)). Secondly, Google in their testing talks say they've basically given up on the idea that they can eliminate all such issues in automated tests - in their opinion it's an engineering tradeoff... I think we can do better :) - I'd like to see us start running 5 or 10 duplicate scenarios to set a lower bound on flakey tests that can enter the system /at all/, and to look for and back out changes that introduce more subtle flakey bugs. -Rob -- Robert Collins <rbtcoll...@hp.com> Distinguished Technologist HP Converged Cloud _______________________________________________ OpenStack-dev mailing list OpenStackfirstname.lastname@example.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev