Am 20.04.24 um 22:25 schrieb Stefan Weil:
Am 16.04.24 um 14:17 schrieb Stefan Weil:
Am 16.04.24 um 14:10 schrieb Peter Maydell:
The cross-i686-tci job is flaky again, with persistent intermittent
failures due to jobs timing out.
[...]
Some of these timeouts are very high -- no test should be taking
10 minutes, even given TCI and a slowish CI runner -- which suggests
to me that there's some kind of intermittent deadlock going on.
Can somebody who cares about TCI investigate, please, and track
down whatever this is?
I'll have a look.
Short summary:
The "persistent intermittent failures due to jobs timing out" are not
related to TCI: they also occur if the same tests are run with the
normal TCG. I suggest that the CI tests should run single threaded.
Hi Paolo,
I need help from someone who knows the CI and the build and test
framework better.
Peter reported intermittent timeouts for the cross-i686-tci job, causing
it to fail. I can reproduce such timeouts locally, but noticed that they
are not limited to TCI. The GitLab CI also shows other examples, such as
this job:
https://gitlab.com/qemu-project/qemu/-/jobs/6700955287
I think the timeouts are caused by running too many parallel processes
during testing.
The CI uses parallel builds:
make -j$(expr $(nproc) + 1) all check-build $MAKE_CHECK_ARGS
It looks like `nproc` returns 8, and make runs with 9 threads.
`meson test` uses the same value to run 9 test processes in parallel:
/builds/qemu-project/qemu/build/pyvenv/bin/meson test --no-rebuild -t 1
--num-processes 9 --print-errorlogs
Since the host can only handle 8 parallel threads, 9 threads might
already cause some tests to run non-deterministically.
But if some of the individual tests also use multithreading (according
to my tests they do so with at least 3 or 4 threads), things get even
worse. Then there are up to 4 * 9 = 36 threads competing to run on the
available 8 cores.
In this scenario timeouts are expected and can occur randomly.
In my tests setting --num-processes to a lower value not only avoided
timeouts but also reduced the processing overhead without increasing the
runtime.
Could we run all tests with `--num-processes 1`?
Thanks,
Stefan