Bug#1011533: chrony: kdevops reboot-limit fails between 50-12,000 reboots
On Tue, May 24, 2022 at 10:19:31AM -0700, Luis Chamberlain wrote: > On Tue, May 24, 2022 at 05:07:56PM +, Luis Chamberlain wrote: > > kernel | reboots | with-fix > > - > > > > > > v5.10.105 | 500 | not-tested-yet > > v5.17-rc7 | 1,200 | 2,000+ > > 5.17.0-1-amd64 | 3,300+ | first-run-stil-running > > Cc'ing Amir so he's aware. The subject is off, it should be between > 500-1,200 reboots. > > Each kdevops reboot-limit digit bump on the .kernel-ci.ok file > after runnning 'make reboot-limit-baseline-loop' represents > 100 tests run, su just multiply the number on the file with 100. > > Note that if you don't enable the CONFIG_WORKFLOW_LINUX_CUSTOM=y > you will just use the default distro kernel and there seems to be > no failures there yet. The patch I proposed could probably be simplified by using systemd logic for dependencies matching. What that is, is not clear, but clearly there is a race which the hack seems to resolve. Luis
Bug#1011533: chrony: kdevops reboot-limit fails between 50-12,000 reboots
On Tue, May 24, 2022 at 05:07:56PM +, Luis Chamberlain wrote: > kernel | reboots | with-fix > - > > > v5.10.105 | 500 | not-tested-yet > v5.17-rc7 | 1,200 | 2,000+ > 5.17.0-1-amd64 | 3,300+ | first-run-stil-running Cc'ing Amir so he's aware. The subject is off, it should be between 500-1,200 reboots. Each kdevops reboot-limit digit bump on the .kernel-ci.ok file after runnning 'make reboot-limit-baseline-loop' represents 100 tests run, su just multiply the number on the file with 100. Note that if you don't enable the CONFIG_WORKFLOW_LINUX_CUSTOM=y you will just use the default distro kernel and there seems to be no failures there yet. Luis
Bug#1011533: chrony: kdevops reboot-limit fails between 50-12,000 reboots
Package: chrony Version: 4.2-2 Severity: important Tags: patch X-Debbugs-Cc: mcg...@kernel.org Dear Maintainer, When using the new kdevops [0] reboot-limit [1] test to see how may reboots can happen with debian-testing without a failure I ran have ran 3 tests with different kernels with the following observations. The point of the test is to simply instantiate vagrant debian-testing guests, and then reboot them and detect with ansible if ssh access to the guest is possible. The test fails upon an ssh timeout or crash. In the list below a + indicates the test is still running. A single digit expresses how many times reboots completed successfully. kernel | reboots | with-fix - v5.10.105 | 500 | not-tested-yet v5.17-rc7 | 1,200 | 2,000+ 5.17.0-1-amd64 | 3,300+ | first-run-stil-running Upon inspection on the failed boots on v5.10.105 and v5.17-rc7 I noticed the following on both systems: root@rebootlimit ~ # sudo systemctl list-units --failed UNIT LOAD ACTIVE SUBDESCRIPTION ● ifup@eth0.service loaded failed failed ifup for eth0 I can see then (scraped from a console, sorry about formatting): ]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 3 ]: DHCPOFFER of 192.168.121.240 from 192.168.121.1 ]: DHCPREQUEST for 192.168.121.240 on eth0 to 255.255.255.255 port 67 ]: DHCPACK of 192.168.121.240 from 192.168.121.1 ]: bound to 192.168.121.240 -- renewal in 1699 seconds. nd to 192.168.121.240 -- renewal in 1699 seconds. -parts: /etc/network/if-up.d/chrony exited with return code 1 p: failed to bring up eth0 ifup@eth0.service: Main process exited, code=exited, status=1/FAILURE ifup@eth0.service: Failed with result 'exit-code'. The important line is: May 21 10:58:58 rebootlimit sh[693]: run-parts: /etc/network/if-up.d/chrony exixited with return code 1 Using $(virsh net-dhcp-leases vagrant-libvirt) I see no takers of the IP address and so there has not been clashes. So my next best guesss given the lack of output from chrony is that this is a race on bootup. I'm still testing things but the following adjustment seems to have helped so far. --- /etc/network/if-up.d/chrony.old 2022-05-24 16:40:53.112439882 + +++ /etc/network/if-up.d/chrony 2022-05-24 16:41:23.452471796 + @@ -5,6 +5,7 @@ [ -x /usr/sbin/chronyd ] || exit 0 if [ -e /run/chrony/chronyd.pid ]; then +systemctl is-system-running --wait chronyc onoffline > /dev/null 2>&1 fi [0] https://github.com/linux-kdevops/kdevops [1] https://github.com/linux-kdevops/kdevops/blob/master/workflows/demos/reboot-limit/Kconfig -- System Information: Debian Release: bookworm/sid APT prefers testing APT policy: (500, 'testing') Architecture: amd64 (x86_64) Kernel: Linux 5.10.105 (SMP w/8 CPU threads) Kernel taint flags: TAINT_UNSIGNED_MODULE Locale: LANG=C.UTF-8, LC_CTYPE=C.UTF-8 (charmap=UTF-8), LANGUAGE not set Shell: /bin/sh linked to /usr/bin/dash Init: systemd (via /run/systemd/system) LSM: AppArmor: enabled Versions of packages chrony depends on: ii adduser 3.121 ii init-system-helpers 1.62 ii iproute2 5.17.0-2 ii libc62.33-7 ii libcap2 1:2.44-1 ii libedit2 3.1-20210910-1 ii libgnutls30 3.7.4-2 ii libnettle8 3.7.3-1 ii libseccomp2 2.5.4-1 ii tzdata 2022a-1 ii ucf 3.0043 chrony