Re: [RFC PATCH-for-5.2] gitlab-ci: Do not automatically run Avocado integration tests anymore

Cleber Rosa Mon, 30 Nov 2020 19:48:55 -0800

On Fri, Nov 27, 2020 at 06:41:10PM +0100, Philippe Mathieu-Daudé wrote:
> We lately realized that the Avocado framework was not designed
> to be regularly run on CI environments. Therefore, as of 5.2


Hi Phil,

First of all, let me say that I understand your overall goal, and
although I don't agree with the strategy, I believe we're in agreement
wrt the destination.

The main issue that you seem to address here is the fact that some CI
tests may fail more often than others, which will lead to jobs that
will fail more than others, which will ultimately taint the overall CI
status.  Does that sound like an appropriate overall grasp of your
motivation?

Assuming I got it right, let me say that having an "always green CI"
is a noble goal, but it can also be extremely frustrating if one
doesn't apply some safeguarding measures.  More on that later.

As you certainly know, I'm also interested in understanding the "not
designed to be regularly run on CI environments" part.  The best
correlation I could make was to link that to the these two points you
raised elsewhere:

 1) Failing randomly
 2) Images hardcoded in tests are being removed from public servers

With regards to point #1, this is probably unavoidable as a whole.
I've had some experience running dedicated test jobs for close to a
decade, and maybe the only way to get close to avoid random failures
on integration tests, is to run close to nothing in those jobs.
Running "/bin/true" has a very low chance of failing randomly.

In my own experience, the only way to address point #1, is to
babysit jobs.  That means:

 a) assume they will produce some messy stuff at no particular time
 b) act as quickly and effectively as possible
 c) be compassionate, that is, waive the unavoidable mess incidents

Building on the previous analogy, if you decide to not have a baby,
but a plant, you'll probably need to to a lot less of those.
If you get a pet, than some more.  Now a human baby will probably
(I guess) require a whole lot more of those.  And as those age
and reach maturity, they'll (hopefully) require less babysitting,
but they can still mess up at any given time.

Analogies and jokes aside, the urgent *action item* here has been
discussed both publicly, and internally at Red Hat. It consists of
having an "always on" maintainer for those jobs.  In the specific case
of the "Acceptance" jobs, Willian Rampazzo has volunteered to,
initially, be this person.  He'll manage all related information
on job's issues.

We're still discussing the tools to use to give the visibility that
the QEMU projects needs.  I personally would be happy enough to start
with a publicly accessible spreadsheet that builds upon the
information produced by GitLab.  A proper application is also being
considered.  A sample of the requirements include:

   I) waive failures (say a job and tests failed because of a
      network outage)
  II) build trends (show how stable all executions of test "foo"
      were during the last: week, month, eternity, etc).
 III) keep a list of the known issues and relate them to waivers
      and currently skipped tests

Getting back to point #2, I have two main points about it.  First is
that I've had a lot of experience with tests having copies of images,
both on local filesystems and in on close by NFS servers.  Local
filesystems would fail at provisioning/sync time. And NFS based ones
would fail every now and then for various network or NFS server
issues.  It goes back to my point about not being able to escape the
babysitting ever.

Second is that this is somehow related to features and improvements
that could/should be added to whathever supporting code (aka
framework) we use.  Right now, we have some specific features
scheduled to be introduced in Avocado 84.0 (due in ~2 weeks).  They
can be seen with the "customer:QEMU" label:

  
https://github.com/avocado-framework/avocado/issues?q=is%3Aissue+is%3Aopen+label%3Acustomer%3AQEMU

A number of other features have already landed on previous versions,
but I was unable to send patches in time for 5.2, so my expectation is
to bundle more of them and bump Avocado to 84.0 at once (instead of
82.0 or 83.0).

> we deprecate the gitlab-ci jobs using Avocado. To not disrupt
> current users, it is possible to keep the current behavior by
> setting the QEMU_CI_INTEGRATION_JOBS_PRE_5_2_RELEASE variable
> (see [*]).
> From now on, using these jobs (or adding new tests to them)
> is strongly discouraged.
>

These jobs run `make check-acceptance`, which will pickup new tests.
So how do you suggest to *not adding new tests* to those jobs?  Are
you suggesting that no new acceptance test be added?

> Tests based on Avocado will be ported to new job schemes during
> the next releases, with better documentation and templates.
>

This is a very good approach to move forward.  For what is worth,
Avocado has invested in an API specifically for that, the Job API.
The goal is to have smarter jobs for different purposes that behave
appropriattely and account for the environment (such as host platform,
CI, etc).  Example of my upcoming "job-kvm-only.py":

------------------------------------------------------------------------------

#!/usr/bin/env python

import os
import sys

from qemu.accel import kvm_available
from avocado.core.job import Job


def main():
    if not kvm_available():
        sys.exit(0)

    config = {'run.references': ['tests/acceptance/'],
              'filter.by_tags.tags': ['accel:kvm,arch:%s' % os.uname()[4]]}
    with Job.from_config(config) as job:
        return job.run()


if __name__ == '__main__':
    sys.exit(main())

------------------------------------------------------------------------------

Other examples of Jobs using this API can be seen here:

   https://github.com/avocado-framework/avocado/tree/master/examples/jobs

And the documentation on the features one can use by setting
configuration keys can be found here:

   https://avocado-framework.readthedocs.io/en/83.0/config/index.html

So for example, if one wants to ignore errors while fetching assets in a job,
there is this:

   
https://avocado-framework.readthedocs.io/en/83.0/config/index.html#assets-fetch-ignore-errors

> [*] 
> https://docs.gitlab.com/ee/ci/variables/README.html#create-a-custom-variable-in-the-ui
> 
> Signed-off-by: Philippe Mathieu-Daudé <phi...@redhat.com>
> ---
>  .gitlab-ci.yml | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
> index d0173e82b16..2674407cd13 100644
> --- a/.gitlab-ci.yml
> +++ b/.gitlab-ci.yml
> @@ -66,6 +66,15 @@ include:
>      - cd build
>      - python3 -c 'import json; r = 
> json.load(open("tests/results/latest/results.json")); [print(t["logfile"]) 
> for t in r["tests"] if t["status"] not in ("PASS", "SKIP", "CANCEL")]' | 
> xargs cat
>      - du -chs ${CI_PROJECT_DIR}/avocado-cache
> +  rules:
> +  # As of QEMU 5.2, Avocado is not yet ready to run in CI environments, 
> therefore
> +  # the jobs based on this template are not run automatically (except if the 
> user
> +  # explicitly sets the QEMU_CI_INTEGRATION_JOBS_PRE_5_2_RELEASE environment
> +  # variable). Adding new jobs on top of this template is strongly 
> discouraged.
> +  - if: $QEMU_CI_INTEGRATION_JOBS_PRE_5_2_RELEASE == null
> +    when: manual
> +    allow_failure: true
> +  - when: always
>

I believe the best way to move forward is a bit different than what
you propose here.  I'd go with the babysitting approach, and
agressively disable tests on first sign of failure, instead of
"muting" all of them at once.  My perception is that without the
babysitting and quick actions, new jobs will end up in the same
situation given enough tests are added.

Anyway, thanks for bringing up the discussion here.

- Cleber.

>  build-system-ubuntu:
>    <<: *native_build_job_definition
> -- 
> 2.26.2
>

signature.asc
Description: PGP signature

Re: [RFC PATCH-for-5.2] gitlab-ci: Do not automatically run Avocado integration tests anymore

Reply via email to