On Fri, Nov 27, 2020 at 06:41:10PM +0100, Philippe Mathieu-Daudé wrote: > We lately realized that the Avocado framework was not designed > to be regularly run on CI environments. Therefore, as of 5.2
Hi Phil, First of all, let me say that I understand your overall goal, and although I don't agree with the strategy, I believe we're in agreement wrt the destination. The main issue that you seem to address here is the fact that some CI tests may fail more often than others, which will lead to jobs that will fail more than others, which will ultimately taint the overall CI status. Does that sound like an appropriate overall grasp of your motivation? Assuming I got it right, let me say that having an "always green CI" is a noble goal, but it can also be extremely frustrating if one doesn't apply some safeguarding measures. More on that later. As you certainly know, I'm also interested in understanding the "not designed to be regularly run on CI environments" part. The best correlation I could make was to link that to the these two points you raised elsewhere: 1) Failing randomly 2) Images hardcoded in tests are being removed from public servers With regards to point #1, this is probably unavoidable as a whole. I've had some experience running dedicated test jobs for close to a decade, and maybe the only way to get close to avoid random failures on integration tests, is to run close to nothing in those jobs. Running "/bin/true" has a very low chance of failing randomly. In my own experience, the only way to address point #1, is to babysit jobs. That means: a) assume they will produce some messy stuff at no particular time b) act as quickly and effectively as possible c) be compassionate, that is, waive the unavoidable mess incidents Building on the previous analogy, if you decide to not have a baby, but a plant, you'll probably need to to a lot less of those. If you get a pet, than some more. Now a human baby will probably (I guess) require a whole lot more of those. And as those age and reach maturity, they'll (hopefully) require less babysitting, but they can still mess up at any given time. Analogies and jokes aside, the urgent *action item* here has been discussed both publicly, and internally at Red Hat. It consists of having an "always on" maintainer for those jobs. In the specific case of the "Acceptance" jobs, Willian Rampazzo has volunteered to, initially, be this person. He'll manage all related information on job's issues. We're still discussing the tools to use to give the visibility that the QEMU projects needs. I personally would be happy enough to start with a publicly accessible spreadsheet that builds upon the information produced by GitLab. A proper application is also being considered. A sample of the requirements include: I) waive failures (say a job and tests failed because of a network outage) II) build trends (show how stable all executions of test "foo" were during the last: week, month, eternity, etc). III) keep a list of the known issues and relate them to waivers and currently skipped tests Getting back to point #2, I have two main points about it. First is that I've had a lot of experience with tests having copies of images, both on local filesystems and in on close by NFS servers. Local filesystems would fail at provisioning/sync time. And NFS based ones would fail every now and then for various network or NFS server issues. It goes back to my point about not being able to escape the babysitting ever. Second is that this is somehow related to features and improvements that could/should be added to whathever supporting code (aka framework) we use. Right now, we have some specific features scheduled to be introduced in Avocado 84.0 (due in ~2 weeks). They can be seen with the "customer:QEMU" label: https://github.com/avocado-framework/avocado/issues?q=is%3Aissue+is%3Aopen+label%3Acustomer%3AQEMU A number of other features have already landed on previous versions, but I was unable to send patches in time for 5.2, so my expectation is to bundle more of them and bump Avocado to 84.0 at once (instead of 82.0 or 83.0). > we deprecate the gitlab-ci jobs using Avocado. To not disrupt > current users, it is possible to keep the current behavior by > setting the QEMU_CI_INTEGRATION_JOBS_PRE_5_2_RELEASE variable > (see [*]). > From now on, using these jobs (or adding new tests to them) > is strongly discouraged. > These jobs run `make check-acceptance`, which will pickup new tests. So how do you suggest to *not adding new tests* to those jobs? Are you suggesting that no new acceptance test be added? > Tests based on Avocado will be ported to new job schemes during > the next releases, with better documentation and templates. > This is a very good approach to move forward. For what is worth, Avocado has invested in an API specifically for that, the Job API. The goal is to have smarter jobs for different purposes that behave appropriattely and account for the environment (such as host platform, CI, etc). Example of my upcoming "job-kvm-only.py": ------------------------------------------------------------------------------ #!/usr/bin/env python import os import sys from qemu.accel import kvm_available from avocado.core.job import Job def main(): if not kvm_available(): sys.exit(0) config = {'run.references': ['tests/acceptance/'], 'filter.by_tags.tags': ['accel:kvm,arch:%s' % os.uname()[4]]} with Job.from_config(config) as job: return job.run() if __name__ == '__main__': sys.exit(main()) ------------------------------------------------------------------------------ Other examples of Jobs using this API can be seen here: https://github.com/avocado-framework/avocado/tree/master/examples/jobs And the documentation on the features one can use by setting configuration keys can be found here: https://avocado-framework.readthedocs.io/en/83.0/config/index.html So for example, if one wants to ignore errors while fetching assets in a job, there is this: https://avocado-framework.readthedocs.io/en/83.0/config/index.html#assets-fetch-ignore-errors > [*] > https://docs.gitlab.com/ee/ci/variables/README.html#create-a-custom-variable-in-the-ui > > Signed-off-by: Philippe Mathieu-Daudé <phi...@redhat.com> > --- > .gitlab-ci.yml | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml > index d0173e82b16..2674407cd13 100644 > --- a/.gitlab-ci.yml > +++ b/.gitlab-ci.yml > @@ -66,6 +66,15 @@ include: > - cd build > - python3 -c 'import json; r = > json.load(open("tests/results/latest/results.json")); [print(t["logfile"]) > for t in r["tests"] if t["status"] not in ("PASS", "SKIP", "CANCEL")]' | > xargs cat > - du -chs ${CI_PROJECT_DIR}/avocado-cache > + rules: > + # As of QEMU 5.2, Avocado is not yet ready to run in CI environments, > therefore > + # the jobs based on this template are not run automatically (except if the > user > + # explicitly sets the QEMU_CI_INTEGRATION_JOBS_PRE_5_2_RELEASE environment > + # variable). Adding new jobs on top of this template is strongly > discouraged. > + - if: $QEMU_CI_INTEGRATION_JOBS_PRE_5_2_RELEASE == null > + when: manual > + allow_failure: true > + - when: always > I believe the best way to move forward is a bit different than what you propose here. I'd go with the babysitting approach, and agressively disable tests on first sign of failure, instead of "muting" all of them at once. My perception is that without the babysitting and quick actions, new jobs will end up in the same situation given enough tests are added. Anyway, thanks for bringing up the discussion here. - Cleber. > build-system-ubuntu: > <<: *native_build_job_definition > -- > 2.26.2 >
signature.asc
Description: PGP signature