Timeouts in CI jobs (was: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?)

Stefan Weil via Wed, 24 Apr 2024 09:29:09 -0700

Am 20.04.24 um 22:25 schrieb Stefan Weil:

Am 16.04.24 um 14:17 schrieb Stefan Weil:

Am 16.04.24 um 14:10 schrieb Peter Maydell:

The cross-i686-tci job is flaky again, with persistent intermittent
failures due to jobs timing out.

[...]

Some of these timeouts are very high -- no test should be taking
10 minutes, even given TCI and a slowish CI runner -- which suggests
to me that there's some kind of intermittent deadlock going on.

Can somebody who cares about TCI investigate, please, and track
down whatever this is?


I'll have a look.


Short summary:

The "persistent intermittent failures due to jobs timing out" are notrelated to TCI: they also occur if the same tests are run with thenormal TCG. I suggest that the CI tests should run single threaded.


Hi Paolo,

I need help from someone who knows the CI and the build and testframework better.

Peter reported intermittent timeouts for the cross-i686-tci job, causingit to fail. I can reproduce such timeouts locally, but noticed that theyare not limited to TCI. The GitLab CI also shows other examples, such asthis job:


https://gitlab.com/qemu-project/qemu/-/jobs/6700955287

I think the timeouts are caused by running too many parallel processesduring testing.


The CI uses parallel builds:

make -j$(expr $(nproc) + 1) all check-build $MAKE_CHECK_ARGS

It looks like `nproc` returns 8, and make runs with 9 threads.
`meson test` uses the same value to run 9 test processes in parallel:

/builds/qemu-project/qemu/build/pyvenv/bin/meson test --no-rebuild -t 1--num-processes 9 --print-errorlogs

Since the host can only handle 8 parallel threads, 9 threads mightalready cause some tests to run non-deterministically.

But if some of the individual tests also use multithreading (accordingto my tests they do so with at least 3 or 4 threads), things get evenworse. Then there are up to 4 * 9 = 36 threads competing to run on theavailable 8 cores.


In this scenario timeouts are expected and can occur randomly.

In my tests setting --num-processes to a lower value not only avoidedtimeouts but also reduced the processing overhead without increasing theruntime.


Could we run all tests with `--num-processes 1`?

Thanks,
Stefan

Timeouts in CI jobs (was: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?)

Reply via email to