Re: Inscrutable CI jobs (avocado & Travis s390 check-tcg)

Alex Bennée Fri, 23 Sep 2022 03:12:01 -0700


Thomas Huth <th...@redhat.com> writes:


> On 23/09/2022 09.28, Daniel P. Berrangé wrote:
>> On Thu, Sep 22, 2022 at 03:04:12PM -0400, Stefan Hajnoczi wrote:
>>> QEMU's avocado and Travis s390x check-tcg CI jobs fail often and I don't
>>> know why. I think it's due to timeouts but maybe there is something
>>> buried in the logs that I missed.
>>>
>>> I waste time skimming through logs when merging qemu.git pull requests
>>> and electricity is wasted on tests that don't produce useful pass/fail
>>> output.
>>>
>>> Here are two recent examples:
>>> https://gitlab.com/qemu-project/qemu/-/jobs/3070754718
>>> https://app.travis-ci.com/gitlab/qemu-project/qemu/jobs/583629583
>>>
>>> If there are real test failures then the test output needs to be
>>> improved so people can identify failures.
>>>
>>> If the tests are timing out then they need to be split up and/or reduced
>>> in duration. BTW, if it's a timeout, why are we using an internal
>>> timeout instead of letting CI mark the job as timed out?
>>>
>>> Any other ideas for improving these CI jobs?
>> The avocado job there does show the errors, but the summary at the
>> end leaves something to be desired. At first glance it looked like
>> everything passed because it says "ERROR 0" and that's what caught
>> my eye. Took a long time to notice the 'INTERRUPT 5' bit is actually
>> just an error state too.  I don't understand why it has to have so
>> many different ways of saying the same thing:
>>    RESULTS    : PASS 14 | ERROR 0 | FAIL 0 | SKIP 37 | WARN 0 |
>> INTERRUPT 5 | CANCEL 136
>>    "ERROR", "FAIL" and "INTERRUPT" are all just the same thing
>>    "SKIP" and "CANCEL" are just the same thing
>> I'm sure there was some reason for these different terms, but IMHO
>> they
>> are actively unhelpful.
>> For example I see no justiable reason for the choice of SKIP vs
>> CANCEL
>> in these two messages:
>>   (173/192)
>> tests/avocado/virtiofs_submounts.py:VirtiofsSubmountsTest.test_pre_launch_set_up:
>> SKIP: sudo -n required, but "sudo -n true" failed: [Errno 2] No such
>> file or directory: 'sudo'
>>   (183/192)
>> tests/avocado/x86_cpu_model_versions.py:X86CPUModelAliases.test_4_1_alias:
>> CANCEL: No QEMU binary defined or found in the build tree (0.00 s)
>> It would be clearer to understand the summary as:
>>   RESULTS: PASS 14 | ERROR 5 | SKIP 173 | WARN 0
>> I'd also like to see it repeat the error messages for the failed
>> tests at the end, so you don't have to search back up through the
>> huge log to find them.
>> On the TCG tests we see
>> imeout --foreground 90
>> /home/travis/build/qemu-project/qemu/build/qemu-s390x  noexec >
>> noexec.out
>> make[1]: *** [../Makefile.target:158: run-noexec] Error 1
>> make[1]: Leaving directory
>> '/home/travis/build/qemu-project/qemu/build/tests/tcg/s390x-linux-user'
>> make: ***
>> [/home/travis/build/qemu-project/qemu/tests/Makefile.include:60:
>> run-tcg-tests-s390x-linux-user] Error 2
>> I presume that indicates the 'noexec' test failed, but we have zero
>> info.
>
> I think this is the bug that will be fixed by Ilya's patch here:
>
>  https://lists.gnu.org/archive/html/qemu-devel/2022-09/msg02756.html
>
> But I agree, it is unfortunate that the output is not available.
> Looking at this on my s390x box:
>
> $ cat tests/tcg/s390x-linux-user/noexec.out
> [ RUN      ] fallthrough
> [       OK ]
> [ RUN      ] jump
> [  FAILED  ] unexpected SEGV
>
> so there is an indication of what's going wrong in there indeed.
>
> Alex, would it be possible to change the tcg test harness to dump the
> .out file of failing tests?

Yes I think so, either by tweaking the run-% rules in
tests/tcg/Makefile.target to handle a failed call or possibly expanding
the run-test macro itself.

>
>  Thomas


-- 
Alex Bennée

Re: Inscrutable CI jobs (avocado & Travis s390 check-tcg)

Reply via email to