On 05/09/2013 11:31 PM, Ademar de Souza Reis Jr. wrote:
Hi.I've been discussing this with Cleber and Lucas (in private just because I'm their manager at Red Hat) and decided to open this to the general audience of autotest, in the hope that we'll get more ideas and refine the brainstorm: We have an internal testgrid that runs some of the virt-tests: but for each tests that passes, we've been getting at least 2 notifications of failures. The absolute majority of them is due to infrastructure problems (network, some repository is offline, disk is full, NFS failure, Cobbler failure, job aborted, etc). This is a historical problem in autotest and we want a clean solution to solve it for good, without kludges or ugly hacks. So I propose we think outside of the box: what would be the ideal solution to this problem, without the limitations imposed by the current autotest architecture or backwards compatibility? Once we have this ideal solution as a goal, we start thinking of what needs to be sacrificed because of the autotest architecture, not the other way around. Naturally, the solution can be implemented in phases. Here is my proposal, at the requirements level: Definitions: - Testgrid: the infrastructure used to run tests. It's composed of test runner(s), scheduler(s), RPC server(s), database(s), infrastructure for provisioning, etc. - Autotest user: submits jobs to the testgrid and/or monitors the output of the jobs run; - Testgrid admin: responsible for the maintenance of the testgrid, fixing the infrastructure and the services that it depends on; Requirements (as user stories) - As an autotest user, I want to be able to declare requirements for my test to be run. For example, I may need a specific package installed, or a specific service to be online. Besides, the test runner should automatically find out some requirements based on the test code I write. For example: if I use a method exposed by autotest that has a dependency on a particular service or package, the test runner should automatically consider it a requirement of my test as well.
We have some code for kernel version and package version check based on the cfg files in qemu part tests. The code is shows in qemu/control.kernel-version in virt repo(our kernel cfg is under qemu/cfg/host-kernel.cfg). This solution just require user declare the packages that need checked and if the package is not installed, the case will be dropped if the package check is failed. The service check maybe can be done in a similar way.
- As an autotest user, I want two primary kinds of notifications sent to me over e-mail: either the test run and passed, or the test run and failed (note: the test did run). Receiving a notification of a test failure should be like an alarm: it means there's something broken with *my code* and needs immediate attention. False positives should be a very rare exception. Test jobs that failed due to broken infrastructure or broken services should be kept in a queue for a (long) period of time until the infrastructure gets fixed. After that period, they should be aborted, potentially sending me a notification e-mail. - As an autotest user, I want the e-mail that notifies me of the job status to be consistent and clear about what went wrong. It should include links to more detailed information, log snippets, version of the components run, failure rate, etc. I don't want e-mails with missing fields or inconsistent reports.- As an autotest user I should be able to query the testgridqueue, my job status and the testgrid status via some sort of webservice API. A dashboard and rpc-client using this API would be great.
We may have some code to generate test job coverage dashboard. Seems like this items. But still need some work on it. cc Feng Yang
- As a testgrid admin, I want to be notified if a service isbroken or offline. I want to have scripts or tests that monitor these services and pause the testgrid if something is wrong, putting the test jobs on hold.- As a testgrid admin I want to tell for how long the testgridwas offline due to broken infrastructure and list which services went down and when, to have a general idea of what's broken and needs to be fixed in the long term. - As a testgrid admin, I want to be able to select which tests should run on which platform/hardware/OS. For example, I want to blacklist some tests (or variants) from a particular machine in the virtlab, or from a particular version of the OS. Comments? Thanks. - Ademar
_______________________________________________ Autotest-kernel mailing list [email protected] https://www.redhat.com/mailman/listinfo/autotest-kernel
