Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

Sean Dague Thu, 24 Jul 2014 15:26:24 -0700

On 07/24/2014 06:15 PM, Angus Salkeld wrote:
> On Wed, 2014-07-23 at 14:39 -0700, James E. Blair wrote:
>> OpenStack has a substantial CI system that is core to its development
>> process.  The goals of the system are to facilitate merging good code,
>> prevent regressions, and ensure that there is at least one configuration
>> of upstream OpenStack that we know works as a whole.  The "project
>> gating" technique that we use is effective at preventing many kinds of
>> regressions from landing, however more subtle, non-deterministic bugs
>> can still get through, and these are the bugs that are currently
>> plaguing developers with seemingly random test failures.
>>
>> Most of these bugs are not failures of the test system; they are real
>> bugs.  Many of them have even been in OpenStack for a long time, but are
>> only becoming visible now due to improvements in our tests.  That's not
>> much help to developers whose patches are being hit with negative test
>> results from unrelated failures.  We need to find a way to address the
>> non-deterministic bugs that are lurking in OpenStack without making it
>> easier for new bugs to creep in.
>>
>> The CI system and project infrastructure are not static.  They have
>> evolved with the project to get to where they are today, and the
>> challenge now is to continue to evolve them to address the problems
>> we're seeing now.  The QA and Infrastructure teams recently hosted a
>> sprint where we discussed some of these issues in depth.  This post from
>> Sean Dague goes into a bit of the background: [1].  The rest of this
>> email outlines the medium and long-term changes we would like to make to
>> address these problems.
>>
>> [1] https://dague.net/2014/07/22/openstack-failures/
>>
>> ==Things we're already doing==
>>
>> The elastic-recheck tool[2] is used to identify "random" failures in
>> test runs.  It tries to match failures to known bugs using signatures
>> created from log messages.  It helps developers prioritize bugs by how
>> frequently they manifest as test failures.  It also collects information
>> on unclassified errors -- we can see how many (and which) test runs
>> failed for an unknown reason and our overall progress on finding
>> fingerprints for random failures.
>>
>> [2] http://status.openstack.org/elastic-recheck/
>>
>> We added a feature to Zuul that lets us manually "promote" changes to
>> the top of the Gate pipeline.  When the QA team identifies a change that
>> fixes a bug that is affecting overall gate stability, we can move that
>> change to the top of the queue so that it may merge more quickly.
>>
>> We added the clean check facility in reaction to the January gate break
>> down. While it does mean that any individual patch might see more tests
>> run on it, it's now largely kept the gate queue at a countable number of
>> hours, instead of regularly growing to more than a work day in
>> length. It also means that a developer can Approve a code merge before
>> tests have returned, and not ruin it for everyone else if there turned
>> out to be a bug that the tests could catch.
>>
>> ==Future changes==
>>
>> ===Communication===
>> We used to be better at communicating about the CI system.  As it and
>> the project grew, we incrementally added to our institutional knowledge,
>> but we haven't been good about maintaining that information in a form
>> that new or existing contributors can consume to understand what's going
>> on and why.
>>
>> We have started on a major effort in that direction that we call the
>> "infra-manual" project -- it's designed to be a comprehensive "user
>> manual" for the project infrastructure, including the CI process.  Even
>> before that project is complete, we will write a document that
>> summarizes the CI system and ensure it is included in new developer
>> documentation and linked to from test results.
>>
>> There are also a number of ways for people to get involved in the CI
>> system, whether focused on Infrastructure or QA, but it is not always
>> clear how to do so.  We will improve our documentation to highlight how
>> to contribute.
>>
>> ===Fixing Faster===
>>
>> We introduce bugs to OpenStack at some constant rate, which piles up
>> over time. Our systems currently treat all changes as equally risky and
>> important to the health of the system, which makes landing code changes
>> to fix key bugs slow when we're at a high reset rate. We've got a manual
>> process of promoting changes today to get around this, but that's
>> actually quite costly in people time, and takes getting all the right
>> people together at once to promote changes. You can see a number of the
>> changes we promoted during the gate storm in June [3], and it was no
>> small number of fixes to get us back to a reasonably passing gate. We
>> think that optimizing this system will help us land fixes to critical
>> bugs faster.
>>
>> [3] https://etherpad.openstack.org/p/gatetriage-june2014
>>
>> The basic idea is to use the data from elastic recheck to identify that
>> a patch is fixing a critical gate related bug. When one of these is
>> found in the queues it will be given higher priority, including bubbling
>> up to the top of the gate queue automatically. The manual promote
>> process should no longer be needed, and instead bugs fixing elastic
>> recheck tracked issues will be promoted automatically.
>>
>> At the same time we'll also promote review on critical gate bugs through
>> making them visible in a number of different channels (like on elastic
>> recheck pages, review day, and in the gerrit dashboards). The idea here
>> again is to make the reviews that fix key bugs pop to the top of
>> everyone's views.
>>
>> ===Testing more tactically===
>>
>> One of the challenges that exists today is that we've got basically 2
>> levels of testing in most of OpenStack: unit tests, and running a whole
>> OpenStack cloud. Over time we've focused on adding more and more
>> configurations and tests to the latter, but as we've seen, when things
>> fail in a whole OpenStack cloud, getting to the root cause is often
>> quite hard. So hard in fact that most people throw up their hands and
>> just run 'recheck'. If a test run fails, and no one looks at why, does
>> it provide any value?
>>
>> We need to get to a balance where we are testing that OpenStack works as
>> a whole in some configuration, but as we've seen, even our best and
>> brightest can't seem to make OpenStack reliably boot a compute that has
>> working networking 100% the time if we happen to be running more than 1
>> API request at once.
>>
>> Getting there is a multi party process:
>>
>>   * Reduce the gating configurations down to some gold standard
>>     configuration(s). This will be a small number of configurations that
>>     we all agree that everything will gate on. This means things like
>>     postgresql, cells, different environments will all get dropped from
>>     the gate as we know it.
>>
>>   * Put the burden for a bunch of these tests back on the projects as
>>     "functional" tests. Basically a custom devstack environment that a
>>     project can create with a set of services that they minimally need
>>     to do their job. These functional tests will live in the project
>>     tree, not in Tempest, so can be atomically landed as part of the
>>     project normal development process.
> 
> We do this in Solum and I really like it. It's nice for the same
> reviewers to see the functional tests and the code the implements a
> feature.
> 
> One downside is we have had failures due to tempest reworking their
> client code. This hasn't happened for a while, but it would be good
> for tempest to recognize that people are using tempest as a library
> and will maintain API.


To be clear, the functional tests will not be Tempest tests. This is a
different class of testing, it's really another tox target that needs a
devstack to run. A really good initial transition would be things like
the CLI testing.

Also, the Tempest team has gone out of it's way to tell people it's not
a stable interface, and don't do that. Contributions to help make parts
of Tempest into a stable library would be appreciated.

        -Sean

-- 
Sean Dague
http://dague.net

signature.asc
Description: OpenPGP digital signature

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

Reply via email to