Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

Sean Dague Thu, 24 Jul 2014 15:55:28 -0700

On 07/24/2014 05:57 PM, Matthew Treinish wrote:
> On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote:
>> OpenStack has a substantial CI system that is core to its development
>> process.  The goals of the system are to facilitate merging good code,
>> prevent regressions, and ensure that there is at least one configuration
>> of upstream OpenStack that we know works as a whole.  The "project
>> gating" technique that we use is effective at preventing many kinds of
>> regressions from landing, however more subtle, non-deterministic bugs
>> can still get through, and these are the bugs that are currently
>> plaguing developers with seemingly random test failures.
>>
>> Most of these bugs are not failures of the test system; they are real
>> bugs.  Many of them have even been in OpenStack for a long time, but are
>> only becoming visible now due to improvements in our tests.  That's not
>> much help to developers whose patches are being hit with negative test
>> results from unrelated failures.  We need to find a way to address the
>> non-deterministic bugs that are lurking in OpenStack without making it
>> easier for new bugs to creep in.
>>
>> The CI system and project infrastructure are not static.  They have
>> evolved with the project to get to where they are today, and the
>> challenge now is to continue to evolve them to address the problems
>> we're seeing now.  The QA and Infrastructure teams recently hosted a
>> sprint where we discussed some of these issues in depth.  This post from
>> Sean Dague goes into a bit of the background: [1].  The rest of this
>> email outlines the medium and long-term changes we would like to make to
>> address these problems.
>>
>> [1] https://dague.net/2014/07/22/openstack-failures/
>>
>> ==Things we're already doing==
>>
>> The elastic-recheck tool[2] is used to identify "random" failures in
>> test runs.  It tries to match failures to known bugs using signatures
>> created from log messages.  It helps developers prioritize bugs by how
>> frequently they manifest as test failures.  It also collects information
>> on unclassified errors -- we can see how many (and which) test runs
>> failed for an unknown reason and our overall progress on finding
>> fingerprints for random failures.
>>
>> [2] http://status.openstack.org/elastic-recheck/
>>
>> We added a feature to Zuul that lets us manually "promote" changes to
>> the top of the Gate pipeline.  When the QA team identifies a change that
>> fixes a bug that is affecting overall gate stability, we can move that
>> change to the top of the queue so that it may merge more quickly.
>>
>> We added the clean check facility in reaction to the January gate break
>> down. While it does mean that any individual patch might see more tests
>> run on it, it's now largely kept the gate queue at a countable number of
>> hours, instead of regularly growing to more than a work day in
>> length. It also means that a developer can Approve a code merge before
>> tests have returned, and not ruin it for everyone else if there turned
>> out to be a bug that the tests could catch.
>>
>> ==Future changes==
>>
>> ===Communication===
>> We used to be better at communicating about the CI system.  As it and
>> the project grew, we incrementally added to our institutional knowledge,
>> but we haven't been good about maintaining that information in a form
>> that new or existing contributors can consume to understand what's going
>> on and why.
>>
>> We have started on a major effort in that direction that we call the
>> "infra-manual" project -- it's designed to be a comprehensive "user
>> manual" for the project infrastructure, including the CI process.  Even
>> before that project is complete, we will write a document that
>> summarizes the CI system and ensure it is included in new developer
>> documentation and linked to from test results.
>>
>> There are also a number of ways for people to get involved in the CI
>> system, whether focused on Infrastructure or QA, but it is not always
>> clear how to do so.  We will improve our documentation to highlight how
>> to contribute.
>>
>> ===Fixing Faster===
>>
>> We introduce bugs to OpenStack at some constant rate, which piles up
>> over time. Our systems currently treat all changes as equally risky and
>> important to the health of the system, which makes landing code changes
>> to fix key bugs slow when we're at a high reset rate. We've got a manual
>> process of promoting changes today to get around this, but that's
>> actually quite costly in people time, and takes getting all the right
>> people together at once to promote changes. You can see a number of the
>> changes we promoted during the gate storm in June [3], and it was no
>> small number of fixes to get us back to a reasonably passing gate. We
>> think that optimizing this system will help us land fixes to critical
>> bugs faster.
>>
>> [3] https://etherpad.openstack.org/p/gatetriage-june2014
>>
>> The basic idea is to use the data from elastic recheck to identify that
>> a patch is fixing a critical gate related bug. When one of these is
>> found in the queues it will be given higher priority, including bubbling
>> up to the top of the gate queue automatically. The manual promote
>> process should no longer be needed, and instead bugs fixing elastic
>> recheck tracked issues will be promoted automatically.
>>
>> At the same time we'll also promote review on critical gate bugs through
>> making them visible in a number of different channels (like on elastic
>> recheck pages, review day, and in the gerrit dashboards). The idea here
>> again is to make the reviews that fix key bugs pop to the top of
>> everyone's views.
>>
>> ===Testing more tactically===
>>
>> One of the challenges that exists today is that we've got basically 2
>> levels of testing in most of OpenStack: unit tests, and running a whole
>> OpenStack cloud. Over time we've focused on adding more and more
>> configurations and tests to the latter, but as we've seen, when things
>> fail in a whole OpenStack cloud, getting to the root cause is often
>> quite hard. So hard in fact that most people throw up their hands and
>> just run 'recheck'. If a test run fails, and no one looks at why, does
>> it provide any value?
>>
>> We need to get to a balance where we are testing that OpenStack works as
>> a whole in some configuration, but as we've seen, even our best and
>> brightest can't seem to make OpenStack reliably boot a compute that has
>> working networking 100% the time if we happen to be running more than 1
>> API request at once.
>>
>> Getting there is a multi party process:
>>
>>   * Reduce the gating configurations down to some gold standard
>>     configuration(s). This will be a small number of configurations that
>>     we all agree that everything will gate on. This means things like
>>     postgresql, cells, different environments will all get dropped from
>>     the gate as we know it.
>>
>>   * Put the burden for a bunch of these tests back on the projects as
>>     "functional" tests. Basically a custom devstack environment that a
>>     project can create with a set of services that they minimally need
>>     to do their job. These functional tests will live in the project
>>     tree, not in Tempest, so can be atomically landed as part of the
>>     project normal development process.
>>
>>   * For all non gold standard configurations, we'll dedicate a part of
>>     our infrastructure to running them in a continuous background loop,
>>     as well as making these configs available as experimental jobs. The
>>     idea here is that we'll actually be able to provide more
>>     configurations that are operating in a more traditional CI (post
>>     merge) context. People that are interested in keeping these bits
>>     functional can monitor those jobs and help with fixes when needed.
>>     The experimental jobs mean that if developers are concerned about
>>     the effect of a particular change on one of these configs, it's easy
>>     to request a pre-merge test run.  In the near term we might imagine
>>     this would allow for things like ceph, mongodb, docker, and possibly
>>     very new libvirt to be validated in some way upstream.
>>
>>   * Provide some kind of easy to view dashboards of these jobs, as well
>>     as a policy that if some job is failing for > some period of time,
>>     it's removed from the system. We want to provide whatever feedback
>>     we can to engaged parties, but people do need to realize that
>>     engagement is key. The biggest part of putting tests into OpenStack
>>     isn't landing the tests, but dealing with their failures.
>>
>>   * Encourage projects to specifically land interface tests in other
>>     projects when they depend on certain behavior.
> 
> So I think we (or least I do) need clarification around this item. My question
> is which interfaces are we depending on that need these specific types of
> tests? Projects shouldn't be depending on another project's unstable 
> interfaces.
> If specific behavior is required for a cross-project interaction it should be
> part of defined stable API, hopefully the REST API, and then that behavior
> should be enforced for everyone not just the cross-project interaction.
> 
> If I'm interpreting this correctly the what is actually needed here is to
> actually ensure that there is test coverage somewhere for the APIs that should
> already be tested where there is a cross-project dependency. This is actually
> the same thing we see all the time because there is a lack of test coverage
> on certain APIs that are being used. (the nova default quotas example comes to
> mind) I just think calling this a special class of test is a bit misleading.
> Since it shouldn't actually differ than any other API test. Or am I missing
> something?


Projects are consuming the behavior of other projects far beyond just
the formal REST APIs. Notifications is another great instance of that.

This is also more of a pragmatic organic approach to figuring out the
interfaces we need to lock down. When one projects breaks depending on
an interface in another project, that should trigger this kind of
contract growth, which hopefully formally turns into a document later
for a stable interface.

>> Let's imagine an example of how this works in the real world.
>>
>>   * The heat-slow job is deleted.
>>
>>   * The heat team creates a specific functional job which tests some of
>>     their deeper function in Heat, all the tests live in Heat, and
>>     because of these the tests can include white/grey box testing of the
>>     DB and queues while things are progressing.
>>
>>   * Nova lands a change which neither Tempest or our configs exercise,
>>     but breaks Heat.
>>
>>   * The Heat project can now decide if it's more important to keep the
>>     test in place (preventing them from landing code), or to skip it to
>>     get back to work.
>>
>>   * The Heat team then works on the right fix for Nova, or communicates
>>     with the Nova team on the issue at hand. The fix to Nova *also*
>>     should include tests which locks down that interface so that Nova
>>     won't break them again in the future (the ironic team did this with
>>     their test_ironic_contract patch). These tests could be unit tests,
>>     if they are testable that way, or functional tests in the Nova tree.
> 
> The one thing I want to point out here is that ironic_contract test should be
> an exception, I don't feel that we want that to be the norm. It's not a good
> example for a few reasons, mostly around the fact that ironic tree depends on
> the purposefully unstable nova driver api as temporary measure until the 
> ironic
> driver is merged into the nova tree. The contract api tests will go away once
> the driver is in the nova tree. It should not be necessary for something over
> the REST API, since the contact should be enforced through tempest. (even 
> under
> this new model, I expect this to still be true)
> 
> There was that comment which someone (I can't remember who) brought up at the
> Portland summit that tempest is acting like double book accounting for the api
> contract, and that has been something we've seen as extremely valuable
> historically. Which is why I don't want to see this aspect of tempest's role 
> in
> the gate altered.

I've been the holder of the double book accounting pov in the past.
However, after the last six months of fragility, I just don't see how
that's a sustainable point of view. The QA team remains somewhat
constant size, and the number of interfaces and projects grows at a good
clip.

> Although, all I think we actually need is an api definition for testing in an
> external repo, just to prevent inadvertent changes. (whether that gets used in
> tempest or not) So another alternative I see here is something that I've 
> started
> to outline in [4] to address the potential for code duplication and effort in
> the new functional test suites. If all the project specific functional tests 
> are
> using clients from an external functional testing library repo then this 
> concern
> goes away.

Actually, I don't think these wold be using external clients. This is in
tree testing.

This will definitely be an experiment to get the API testing closer to
the source. That being said, Swift really has done this fine for a long
time, and I think we need to revisit the premise that projects can't be
trusted.

> Now, if something like this example were to be exposed because of a coverage
> gap I think it's fair game to have a specific test in nova's functional test
> suite. But, I also think there should be an external audit of that API 
> somewhere
> too. Ideally I think what I'd like to see is probably a write-once test
> graduation procedure for moving appropriate things into tempest (or somewhere
> else) from the project specific functional tests. Basically like what we
> discussed during Maru's summit session on Neutron functional testing in 
> Atlanta.

Right, and I think basically we shouldn't graduate most of those tests.
They are neutron tests, in the neutron tree. A few key ones we decide
should be run outside that context.

> For the other, more social, goal of this step in fostering communication 
> between
> the projects and not using QA and/or Infra as a middle man I fully support. I
> agree that we probably have too proxying going on between projects using QA
> and/or infra instead of necessarily talking directly.

Our current model leans far too much on the idea of the only time we
ever try to test things for real is when we throw all 1 million lines of
source code into one pot and stir. It really shouldn't be surprising how
many bugs shake out there. And this is the wrong layer to debug from, so
I firmly believe we need to change this back to something we can
actually manage to shake the bugs out with. Because right now we're
finding them, but our infrastructure isn't optimized for fixing them,
and we need to change that.

        -Sean

-- 
Sean Dague
http://dague.net

signature.asc
Description: OpenPGP digital signature

_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

Reply via email to