James E. Blair wrote:
> [...]
> Most of these bugs are not failures of the test system; they are real
> bugs.  Many of them have even been in OpenStack for a long time, but are
> only becoming visible now due to improvements in our tests.  That's not
> much help to developers whose patches are being hit with negative test
> results from unrelated failures.  We need to find a way to address the
> non-deterministic bugs that are lurking in OpenStack without making it
> easier for new bugs to creep in.

I think that's a critical point. As a community, we need to move from a
perspective where we see the gate as a process step and failure there
being described as "the gate is broken".

Although in some cases the failures are indeed coming from a gate bug,
in most cases the failures are coming from a pileup of race conditions
and other rare errors in OpenStack itself. In other words, the gate is
not broken, *OpenStack* is broken. If you can't get the tests to pass on
a proposed change due to test failures, that means OpenStack itself has
reached a level where it just doesn't work. The gate is just a thermometer.

Those type of problems need to be solved, even if changes can be
introduced in the CI/gate system to mitigate some of their most painful
side-effects. However, currently, only a handful of developers actually
work on fixing such issues -- and today those developers are completely
overwhelmed and burnt out.

We need to have more people working on those bugs. We need to
communicate this key type of strategic contribution to our corporate
sponsors. We need to make it practical to work on those bugs, by
providing all the tools we can to help in the debugging. We need to make
it rewarding to work on those bugs: some of those bugs will be the most
complex bugs you can find in OpenStack -- they should be viewed as an
intellectual challenge for our best minds, rather than as cleaning up a
sewer that other people continuously contribute to fill.

> The CI system and project infrastructure are not static.  They have
> evolved with the project to get to where they are today, and the
> challenge now is to continue to evolve them to address the problems
> we're seeing now.  The QA and Infrastructure teams recently hosted a
> sprint where we discussed some of these issues in depth.  This post from
> Sean Dague goes into a bit of the background: [1].  The rest of this
> email outlines the medium and long-term changes we would like to make to
> address these problems.
> [...]

I like all the options suggested there, and I enjoyed the discussion
that followed.

Thierry Carrez (ttx)

OpenStack-dev mailing list

Reply via email to