Thomas Huth <th...@redhat.com> writes:
> On 23/09/2022 09.28, Daniel P. Berrangé wrote: >> On Thu, Sep 22, 2022 at 03:04:12PM -0400, Stefan Hajnoczi wrote: >>> QEMU's avocado and Travis s390x check-tcg CI jobs fail often and I don't >>> know why. I think it's due to timeouts but maybe there is something >>> buried in the logs that I missed. >>> >>> I waste time skimming through logs when merging qemu.git pull requests >>> and electricity is wasted on tests that don't produce useful pass/fail >>> output. >>> >>> Here are two recent examples: >>> https://gitlab.com/qemu-project/qemu/-/jobs/3070754718 >>> https://app.travis-ci.com/gitlab/qemu-project/qemu/jobs/583629583 >>> >>> If there are real test failures then the test output needs to be >>> improved so people can identify failures. >>> >>> If the tests are timing out then they need to be split up and/or reduced >>> in duration. BTW, if it's a timeout, why are we using an internal >>> timeout instead of letting CI mark the job as timed out? >>> >>> Any other ideas for improving these CI jobs? >> The avocado job there does show the errors, but the summary at the >> end leaves something to be desired. At first glance it looked like >> everything passed because it says "ERROR 0" and that's what caught >> my eye. Took a long time to notice the 'INTERRUPT 5' bit is actually >> just an error state too. I don't understand why it has to have so >> many different ways of saying the same thing: >> RESULTS : PASS 14 | ERROR 0 | FAIL 0 | SKIP 37 | WARN 0 | >> INTERRUPT 5 | CANCEL 136 >> "ERROR", "FAIL" and "INTERRUPT" are all just the same thing >> "SKIP" and "CANCEL" are just the same thing >> I'm sure there was some reason for these different terms, but IMHO >> they >> are actively unhelpful. >> For example I see no justiable reason for the choice of SKIP vs >> CANCEL >> in these two messages: >> (173/192) >> tests/avocado/virtiofs_submounts.py:VirtiofsSubmountsTest.test_pre_launch_set_up: >> SKIP: sudo -n required, but "sudo -n true" failed: [Errno 2] No such >> file or directory: 'sudo' >> (183/192) >> tests/avocado/x86_cpu_model_versions.py:X86CPUModelAliases.test_4_1_alias: >> CANCEL: No QEMU binary defined or found in the build tree (0.00 s) >> It would be clearer to understand the summary as: >> RESULTS: PASS 14 | ERROR 5 | SKIP 173 | WARN 0 >> I'd also like to see it repeat the error messages for the failed >> tests at the end, so you don't have to search back up through the >> huge log to find them. >> On the TCG tests we see >> imeout --foreground 90 >> /home/travis/build/qemu-project/qemu/build/qemu-s390x noexec > >> noexec.out >> make[1]: *** [../Makefile.target:158: run-noexec] Error 1 >> make[1]: Leaving directory >> '/home/travis/build/qemu-project/qemu/build/tests/tcg/s390x-linux-user' >> make: *** >> [/home/travis/build/qemu-project/qemu/tests/Makefile.include:60: >> run-tcg-tests-s390x-linux-user] Error 2 >> I presume that indicates the 'noexec' test failed, but we have zero >> info. > > I think this is the bug that will be fixed by Ilya's patch here: > > https://lists.gnu.org/archive/html/qemu-devel/2022-09/msg02756.html > > But I agree, it is unfortunate that the output is not available. > Looking at this on my s390x box: > > $ cat tests/tcg/s390x-linux-user/noexec.out > [ RUN ] fallthrough > [ OK ] > [ RUN ] jump > [ FAILED ] unexpected SEGV > > so there is an indication of what's going wrong in there indeed. > > Alex, would it be possible to change the tcg test harness to dump the > .out file of failing tests? Yes I think so, either by tweaking the run-% rules in tests/tcg/Makefile.target to handle a failed call or possibly expanding the run-test macro itself. > > Thomas -- Alex Bennée