Bug#1011533: chrony: kdevops reboot-limit fails between 50-12,000 reboots

2022-05-24 Thread Luis Chamberlain
On Tue, May 24, 2022 at 10:19:31AM -0700, Luis Chamberlain wrote:
> On Tue, May 24, 2022 at 05:07:56PM +, Luis Chamberlain wrote:
> > kernel | reboots | with-fix
> > -   
> > 
> >
> > v5.10.105  | 500 | not-tested-yet
> > v5.17-rc7  | 1,200   | 2,000+
> > 5.17.0-1-amd64 | 3,300+  | first-run-stil-running
> 
> Cc'ing Amir so he's aware. The subject is off, it should be between
> 500-1,200 reboots.
> 
> Each kdevops reboot-limit digit bump on the .kernel-ci.ok file
> after runnning 'make reboot-limit-baseline-loop' represents
> 100 tests run, su just multiply the number on the file with 100.
> 
> Note that if you don't enable the CONFIG_WORKFLOW_LINUX_CUSTOM=y
> you will just use the default distro kernel and there seems to be
> no failures there yet.

The patch I proposed could probably be simplified by using systemd
logic for dependencies matching. What that is, is not clear, but
clearly there is a race which the hack seems to resolve.

  Luis



Bug#1011533: chrony: kdevops reboot-limit fails between 50-12,000 reboots

2022-05-24 Thread Luis Chamberlain
On Tue, May 24, 2022 at 05:07:56PM +, Luis Chamberlain wrote:
> kernel | reboots | with-fix
> - 
>   
>
> v5.10.105  | 500 | not-tested-yet
> v5.17-rc7  | 1,200   | 2,000+
> 5.17.0-1-amd64 | 3,300+  | first-run-stil-running

Cc'ing Amir so he's aware. The subject is off, it should be between
500-1,200 reboots.

Each kdevops reboot-limit digit bump on the .kernel-ci.ok file
after runnning 'make reboot-limit-baseline-loop' represents
100 tests run, su just multiply the number on the file with 100.

Note that if you don't enable the CONFIG_WORKFLOW_LINUX_CUSTOM=y
you will just use the default distro kernel and there seems to be
no failures there yet.

  Luis



Bug#1011533: chrony: kdevops reboot-limit fails between 50-12,000 reboots

2022-05-24 Thread Luis Chamberlain
Package: chrony
Version: 4.2-2
Severity: important
Tags: patch
X-Debbugs-Cc: mcg...@kernel.org

Dear Maintainer,

When using the new kdevops [0] reboot-limit [1] test to see how may reboots
can happen with debian-testing without a failure I ran have ran 3 tests
with different kernels with the following observations. The point of the
test is to simply instantiate vagrant debian-testing guests, and then
reboot them and detect with ansible if ssh access to the guest is
possible. The test fails upon an ssh timeout or crash. In the list below
a + indicates the test is still running. A single digit expresses how many
times reboots completed successfully.

kernel | reboots | with-fix
-   

   
v5.10.105  | 500 | not-tested-yet
v5.17-rc7  | 1,200   | 2,000+
5.17.0-1-amd64 | 3,300+  | first-run-stil-running

Upon inspection on the failed boots on v5.10.105 and v5.17-rc7 I
noticed the following on both systems:

root@rebootlimit ~ # sudo systemctl list-units --failed
  UNIT  LOAD   ACTIVE SUBDESCRIPTION

  
● ifup@eth0.service loaded failed failed ifup for eth0 

I can see then (scraped from a console, sorry about formatting):

]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 3   

  
]: DHCPOFFER of 192.168.121.240 from 192.168.121.1  

  
]: DHCPREQUEST for 192.168.121.240 on eth0 to 255.255.255.255 port 67   

  
]: DHCPACK of 192.168.121.240 from 192.168.121.1

  
]: bound to 192.168.121.240 -- renewal in 1699 seconds. 

  
nd to 192.168.121.240 -- renewal in 1699 seconds.   

  
-parts: /etc/network/if-up.d/chrony exited with return code 1   

  
p: failed to bring up eth0  

  
ifup@eth0.service: Main process exited, code=exited, status=1/FAILURE   

  
ifup@eth0.service: Failed with result 'exit-code'.

The important line is:

May 21 10:58:58 rebootlimit sh[693]: run-parts: /etc/network/if-up.d/chrony 
exixited with return code 1

Using $(virsh net-dhcp-leases vagrant-libvirt) I see no takers of the IP
address and so there has not been clashes. So my next best guesss given
the lack of output from chrony is that this is a race on bootup.

I'm still testing things but the following adjustment seems to have
helped so far.

--- /etc/network/if-up.d/chrony.old 2022-05-24 16:40:53.112439882 +
+++ /etc/network/if-up.d/chrony 2022-05-24 16:41:23.452471796 +
@@ -5,6 +5,7 @@
 [ -x /usr/sbin/chronyd ] || exit 0
 
 if [ -e /run/chrony/chronyd.pid ]; then
+systemctl is-system-running --wait
 chronyc onoffline > /dev/null 2>&1
 fi
 

[0] https://github.com/linux-kdevops/kdevops
[1] 
https://github.com/linux-kdevops/kdevops/blob/master/workflows/demos/reboot-limit/Kconfig

-- System Information:
Debian Release: bookworm/sid
  APT prefers testing
  APT policy: (500, 'testing')
Architecture: amd64 (x86_64)

Kernel: Linux 5.10.105 (SMP w/8 CPU threads)
Kernel taint flags: TAINT_UNSIGNED_MODULE
Locale: LANG=C.UTF-8, LC_CTYPE=C.UTF-8 (charmap=UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages chrony depends on:
ii  adduser  3.121
ii  init-system-helpers  1.62
ii  iproute2 5.17.0-2
ii  libc62.33-7
ii  libcap2  1:2.44-1
ii  libedit2 3.1-20210910-1
ii  libgnutls30  3.7.4-2
ii  libnettle8   3.7.3-1
ii  libseccomp2  2.5.4-1
ii  tzdata   2022a-1
ii  ucf  3.0043

chrony