Bug#1004107: meson: flaky autopkgtest on armhf: dictionary changed size during iteration -> timeout
On Thu, 5 May 2022 at 22:39, Paul Gevers wrote: > It just occurred to me that it may be useful to try and reduce the > number of concurrent running tests to something you would expect on a > more normal computer (under conditions where the framework is better > tested). Our armel host has 160 cores, similar, our amd64 ci-worker13 > host has 56. No harm in trying I guess: https://github.com/mesonbuild/meson/pull/10358
Bug#1004107: meson: flaky autopkgtest on armhf: dictionary changed size during iteration -> timeout
Hi Jussi, On 21-01-2022 19:17, Paul Gevers wrote: Running tests with 160 workers It just occurred to me that it may be useful to try and reduce the number of concurrent running tests to something you would expect on a more normal computer (under conditions where the framework is better tested). Our armel host has 160 cores, similar, our amd64 ci-worker13 host has 56. Paul https://sources.debian.org/src/meson/0.62.1-1/run_project_tests.py/#L1542 https://sources.debian.org/src/meson/0.62.1-1/run_project_tests.py/#L1552 OpenPGP_signature Description: OpenPGP digital signature
Bug#1004107: meson: flaky autopkgtest on armhf: dictionary changed size during iteration -> timeout
Hi Jussi, On 21-01-2022 00:05, Jussi Pakkanen wrote: On Thu, 20 Jan 2022 at 23:33, Paul Gevers wrote: I looked at the results of the autopkgtest of you package on armhf because it was showing up as a regression for the upload of python-defaults and setuptools. I noticed that the test regularly fails, what's worse, it also seems to hang as the test is killed because it hits an autopkgtest timeout. If we look at the backtrace: Running tests with 160 workers Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib/python3.9/threading.py", line 973, in _bootstrap_inner self.run() File "/usr/lib/python3.9/concurrent/futures/process.py", line 317, in run result_item, is_broken, cause = self.wait_result_broken_or_wakeup() File "/usr/lib/python3.9/concurrent/futures/process.py", line 376, in wait_result_broken_or_wakeup worker_sentinels = [p.sentinel for p in self.processes.values()] File "/usr/lib/python3.9/concurrent/futures/process.py", line 376, in worker_sentinels = [p.sentinel for p in self.processes.values()] RuntimeError: dictionary changed size during iteration This is all Python's internal code. Further, Meson does not do any fancy threading stuff itself, it uses Python's thread and process executors to queue up a bunch of work and then wait for it to be done. According to Python's documentation you don't need to do any locking or similar to submit new work, you can call the submit method directly. All of this would seem to indicate that the issue might lie somewhere in Python's multithreading code. At the very least I have no idea how I should go about debugging that issue. Reading the log, it's not clear to me if that backtrace is even related to the hang. I mean, in my (non-Python related) experience if something goes wrong in a parallel process, you can get logs from parallel pieces that fail, while not really causes problems with the actual processing. Of course, it's still true that it's difficult to troubleshoot. Paul OpenPGP_signature Description: OpenPGP digital signature
Bug#1004107: meson: flaky autopkgtest on armhf: dictionary changed size during iteration -> timeout
On Thu, 20 Jan 2022 at 23:33, Paul Gevers wrote: > I looked at the results of the autopkgtest of you package on armhf > because it was showing up as a regression for the upload of > python-defaults and setuptools. I noticed that the test regularly fails, > what's worse, it also seems to hang as the test is killed because it > hits an autopkgtest timeout. If we look at the backtrace: > Running tests with 160 workers > Exception in thread Thread-1: > Traceback (most recent call last): >File "/usr/lib/python3.9/threading.py", line 973, in _bootstrap_inner > self.run() >File "/usr/lib/python3.9/concurrent/futures/process.py", line 317, in run > result_item, is_broken, cause = self.wait_result_broken_or_wakeup() >File "/usr/lib/python3.9/concurrent/futures/process.py", line 376, in > wait_result_broken_or_wakeup > worker_sentinels = [p.sentinel for p in self.processes.values()] >File "/usr/lib/python3.9/concurrent/futures/process.py", line 376, in > > worker_sentinels = [p.sentinel for p in self.processes.values()] > RuntimeError: dictionary changed size during iteration This is all Python's internal code. Further, Meson does not do any fancy threading stuff itself, it uses Python's thread and process executors to queue up a bunch of work and then wait for it to be done. According to Python's documentation you don't need to do any locking or similar to submit new work, you can call the submit method directly. All of this would seem to indicate that the issue might lie somewhere in Python's multithreading code. At the very least I have no idea how I should go about debugging that issue.
Bug#1004107: meson: flaky autopkgtest on armhf: dictionary changed size during iteration -> timeout
Source: meson Version: 0.56.2-1 Severity: serious X-Debbugs-CC: debian...@lists.debian.org User: debian...@lists.debian.org Usertags: flaky timeout Dear maintainer(s), I looked at the results of the autopkgtest of you package on armhf because it was showing up as a regression for the upload of python-defaults and setuptools. I noticed that the test regularly fails, what's worse, it also seems to hang as the test is killed because it hits an autopkgtest timeout. Because the unstable-to-testing migration software now blocks on regressions in testing, flaky tests, i.e. tests that flip between passing and failing without changes to the list of installed packages, are causing people unrelated to your package to spend time on these tests. In this case, Release Team members had to investigate if curl was OK to go into the next Stable point release. Don't hesitate to reach out if you need help and some more information from our infrastructure. Please note that the host we run our armhf tests on is very powerful. It has 160 cores and 255 GB RAM. This is sometimes the root cause of test that fail. It seems that before we switch to this host, the test was more reliable. Paul https://ci.debian.net/packages/m/meson/testing/amd64/ E.g. https://ci.debian.net/data/autopkgtest/testing/armhf/m/meson/18519155/log.gz Ran 462 tests in 569.524s OK (skipped=66) Meson build system 0.61.0 Unit Tests pytest-xdist not found, using unittest instead Total time: 569.540 seconds Meson build system 0.61.0 Project Tests Using python 3.9.9 (main, Jan 12 2022, 16:10:51) host machine compilers c : [gcc] cc (gcc 11.2.0 "cc (Debian 11.2.0-13) 11.2.0") cpp: [gcc] c++ (gcc 11.2.0 "c++ (Debian 11.2.0-13) 11.2.0") cs : [mono] mcs (mono 6.8.0.105) cuda : [not found] cython : [not found] d : [llvm] ldc2 (llvm 1.28.0 "LDC - the LLVM D compiler (1.28.0):") fortran: [gcc] gfortran (gcc 11.2.0 "GNU Fortran (Debian 11.2.0-13) 11.2.0") java : [unknown] javac (unknown 11.0.13) objc : [gcc] cc (gcc 11.2.0) objcpp : [gcc] c++ (gcc 11.2.0) rust : [rustc]rustc -C linker=cc (rustc 1.56.0) swift : [not found] vala : [valac]valac (valac 0.54.6) tools ninja : /usr/bin/ninja (1.10.1) cmake : /usr/bin/cmake (3.22.1) hotdoc : not found Checking that configuring works... Checking that introspect works... Checking that building works... Checking that testing works... Checking that installing works... Running tests with 160 workers Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib/python3.9/threading.py", line 973, in _bootstrap_inner self.run() File "/usr/lib/python3.9/concurrent/futures/process.py", line 317, in run result_item, is_broken, cause = self.wait_result_broken_or_wakeup() File "/usr/lib/python3.9/concurrent/futures/process.py", line 376, in wait_result_broken_or_wakeup worker_sentinels = [p.sentinel for p in self.processes.values()] File "/usr/lib/python3.9/concurrent/futures/process.py", line 376, in worker_sentinels = [p.sentinel for p in self.processes.values()] RuntimeError: dictionary changed size during iteration Running cmake tests. autopkgtest [16:04:29]: ERROR: timed out on command "su -s /bin/bash debci -c set -e; export USER=`id -nu`; . /etc/profile >/dev/null 2>&1 || true; . ~/.profile >/dev/null 2>&1 || true; buildtree="/tmp/autopkgtest-lxc.f3gr65px/downtmp/build.IRe/src"; mkdir -p -m 1777 -- "/tmp/autopkgtest-lxc.f3gr65px/downtmp/exhaustive-artifacts"; export AUTOPKGTEST_ARTIFACTS="/tmp/autopkgtest-lxc.f3gr65px/downtmp/exhaustive-artifacts"; export ADT_ARTIFACTS="$AUTOPKGTEST_ARTIFACTS"; mkdir -p -m 755 "/tmp/autopkgtest-lxc.f3gr65px/downtmp/autopkgtest_tmp"; export AUTOPKGTEST_TMP="/tmp/autopkgtest-lxc.f3gr65px/downtmp/autopkgtest_tmp"; export ADTTMP="$AUTOPKGTEST_TMP"; export DEBIAN_FRONTEND=noninteractive; export LANG=C.UTF-8; export DEB_BUILD_OPTIONS=parallel=160; unset LANGUAGE LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY LC_MESSAGES LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT LC_IDENTIFICATION LC_ALL;rm -f /tmp/autopkgtest_script_pid; set -C; echo $$ > /tmp/autopkgtest_script_pid; set +C; trap "rm -f /tmp/autopkgtest_script_pid" EXIT INT QUIT PIPE; cd "$buildtree"; chmod +x /tmp/autopkgtest-lxc.f3gr65px/downtmp/build.IRe/src/debian/tests/exhaustive; touch /tmp/autopkgtest-lxc.f3gr65px/downtmp/exhaustive-stdout /tmp/autopkgtest-lxc.f3gr65px/downtmp/exhaustive-stderr; /tmp/autopkgtest-lxc.f3gr65px/downtmp/build.IRe/src/debian/tests/exhaustive 2> >(tee -a /tmp/autopkgtest-lxc.f3gr65px/downtmp/exhaustive-stderr >&2) > >(tee -a /tmp/autopkgtest-lxc.f3gr65px/downtmp/exhaustive-stdout);" (kind: test) autopkgtest [16:04:30]: test exhaustive: ---] OpenPGP_signature Description: OpenPGP digital signature