Re: [openstack-dev] Gate Math or why you you keep typing 'recheck'

Robert Collins Wed, 13 Nov 2013 19:34:28 -0800

On 14 November 2013 16:15, Joe Gordon <joe.gord...@gmail.com> wrote:
> Hi All,
>
> TL;DR: Failure rate for gate jobs in graphite http://tinyurl.com/mqju53r
...
> In short, even tiny bugs in gate have a major impact on the stability of
> gate!  And as we grow the number of integrated projects and increase the
> number of tests this pattern will only get worse.


Thanks for the analysis!

I have two comments (yes, only two!)

Firstly, 5% isn't a tiny bug. It's a huge bug. We're doing thousands
of runs a day. A tiny bug IMO 0.01% occurrence rate or less. Lets
recalibrate our head around failure rates:
a 0.01% failure in a 10K node cloud doing deploys once a day will
happen every day (on average :)).

Secondly, Google in their testing talks say they've basically given up
on the idea that they can eliminate all such issues in automated tests
- in their opinion it's an engineering tradeoff... I think we can do
better :) - I'd like to see us start running 5 or 10 duplicate
scenarios to set a lower bound on flakey tests that can enter the
system /at all/, and to look for and back out changes that introduce
more subtle flakey bugs.

-Rob

-- 
Robert Collins <rbtcoll...@hp.com>
Distinguished Technologist
HP Converged Cloud

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Gate Math or why you you keep typing 'recheck'

Reply via email to