Re: [Autotest] RFC: test failures and notifications revamp

Yiqiao Pu Fri, 10 May 2013 03:02:45 -0700

On 05/09/2013 11:31 PM, Ademar de Souza Reis Jr. wrote:

Hi.


I've been discussing this with Cleber and Lucas (in private just
because I'm their manager at Red Hat) and decided to open this to
the general audience of autotest, in the hope that we'll get more
ideas and refine the brainstorm:

We have an internal testgrid that runs some of the virt-tests:
but for each tests that passes, we've been getting at least 2
notifications of failures. The absolute majority of them is due
to infrastructure problems (network, some repository is offline,
disk is full, NFS failure, Cobbler failure, job aborted, etc).

This is a historical problem in autotest and we want a clean
solution to solve it for good, without kludges or ugly hacks.

So I propose we think outside of the box: what would be the ideal
solution to this problem, without the limitations imposed by the
current autotest architecture or backwards compatibility?

Once we have this ideal solution as a goal, we start thinking of
what needs to be sacrificed because of the autotest architecture,
not the other way around.

Naturally, the solution can be implemented in phases.

Here is my proposal, at the requirements level:

Definitions:
   - Testgrid: the infrastructure used to run tests. It's composed
     of test runner(s), scheduler(s), RPC server(s), database(s),
     infrastructure for provisioning, etc.
   - Autotest user: submits jobs to the testgrid and/or monitors the
     output of the jobs run;
   - Testgrid admin: responsible for the maintenance of the
     testgrid, fixing the infrastructure and the services that it
     depends on;

Requirements (as user stories)
   - As an autotest user, I want to be able to declare
     requirements for my test to be run. For example, I may need a
     specific package installed, or a specific service to be
     online. Besides, the test runner should automatically find
     out some requirements  based on the test code I write. For
     example: if I use a method exposed by autotest that has a
     dependency on a particular service or package, the test
     runner should automatically consider it a requirement of my
     test as well.

We have some code for kernel version and package version check based onthe cfg files in qemu part tests. The code is shows inqemu/control.kernel-version in virt repo(our kernel cfg is underqemu/cfg/host-kernel.cfg). This solution just require user declare thepackages that need checked and if the package is not installed, the casewill be dropped if the package check is failed. The service check maybecan be done in a similar way.


   - As an autotest user, I want two primary kinds of
     notifications sent to me over e-mail: either the test run and
     passed, or the test run and failed (note: the test did run).
     Receiving a notification of a test failure should be like an
     alarm: it means there's something broken with *my code* and
     needs immediate attention. False positives should be a very
     rare exception. Test jobs that failed due to broken
     infrastructure or broken services should be kept in a queue
     for a (long) period of time until the infrastructure gets
     fixed. After that period, they should be aborted,
     potentially sending me a notification e-mail.

   - As an autotest user, I want the e-mail that notifies me of
     the job status to be consistent and clear about what went
     wrong. It should include links to more detailed information,
     log snippets, version of the components run, failure rate,
     etc. I don't want e-mails with missing fields or inconsistent
     reports.

- As an autotest user I should be able to query the testgrid

     queue, my job status and the testgrid status via some sort of
     webservice API. A dashboard and rpc-client using this
     API would be great.

We may have some code to generate test job coverage dashboard. Seemslike this items. But still need some work on it. cc Feng Yang

- As a testgrid admin, I want to be notified if a service is

     broken or offline. I want to have scripts or tests that
     monitor these services and pause the testgrid if something is
     wrong, putting the test jobs on hold.

- As a testgrid admin I want to tell for how long the testgrid

     was offline due to broken infrastructure and list which
     services went down and when, to have a general idea of what's
     broken and needs to be fixed in the long term.

   - As a testgrid admin, I want to be able to select which tests
     should run on which platform/hardware/OS. For example, I want
     to blacklist some tests (or variants) from a particular
     machine in the virtlab, or from a particular version of the
     OS.

Comments?

Thanks.
    - Ademar


_______________________________________________
Autotest-kernel mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/autotest-kernel

Re: [Autotest] RFC: test failures and notifications revamp

Reply via email to