Am 20.04.24 um 22:25 schrieb Stefan Weil:
Am 16.04.24 um 14:17 schrieb Stefan Weil:
Am 16.04.24 um 14:10 schrieb Peter Maydell:

The cross-i686-tci job is flaky again, with persistent intermittent
failures due to jobs timing out.
[...]
Some of these timeouts are very high -- no test should be taking
10 minutes, even given TCI and a slowish CI runner -- which suggests
to me that there's some kind of intermittent deadlock going on.

Can somebody who cares about TCI investigate, please, and track
down whatever this is?

I'll have a look.

Short summary:

The "persistent intermittent failures due to jobs timing out" are not related to TCI: they also occur if the same tests are run with the normal TCG. I suggest that the CI tests should run single threaded.

Hi Paolo,

I need help from someone who knows the CI and the build and test framework better.

Peter reported intermittent timeouts for the cross-i686-tci job, causing it to fail. I can reproduce such timeouts locally, but noticed that they are not limited to TCI. The GitLab CI also shows other examples, such as this job:

https://gitlab.com/qemu-project/qemu/-/jobs/6700955287

I think the timeouts are caused by running too many parallel processes during testing.

The CI uses parallel builds:

make -j$(expr $(nproc) + 1) all check-build $MAKE_CHECK_ARGS

It looks like `nproc` returns 8, and make runs with 9 threads.
`meson test` uses the same value to run 9 test processes in parallel:

/builds/qemu-project/qemu/build/pyvenv/bin/meson test --no-rebuild -t 1 --num-processes 9 --print-errorlogs

Since the host can only handle 8 parallel threads, 9 threads might already cause some tests to run non-deterministically.

But if some of the individual tests also use multithreading (according to my tests they do so with at least 3 or 4 threads), things get even worse. Then there are up to 4 * 9 = 36 threads competing to run on the available 8 cores.

In this scenario timeouts are expected and can occur randomly.

In my tests setting --num-processes to a lower value not only avoided timeouts but also reduced the processing overhead without increasing the runtime.

Could we run all tests with `--num-processes 1`?

Thanks,
Stefan


Reply via email to