Bug#932851: marked as done (systemd causes diskless nodes to stop working / require hard reset)

Debian Bug Tracking System Thu, 25 Jul 2019 04:57:22 -0700

Your message dated Thu, 25 Jul 2019 13:55:28 +0200
with message-id <[email protected]>
and subject line Re: Bug#932851: systemd causes diskless nodes to stop working 
/ require hard reset
has caused the Debian Bug report #932851,
regarding systemd causes diskless nodes to stop working / require hard reset
to be marked as done.


This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact [email protected]
immediately.)


-- 
932851: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=932851
Debian Bug Tracking System
Contact [email protected] with problems

--- Begin Message ---

Package: systemd
Version: 241-5

Ever since the switch to systemd we have been experiencing significant problems 
with our diskless nodes, where if the NFS connection is dropped for any reason 
(NFS server reboot, network router state reset, etc.) there is a high chance 
the diskless nodes will enter an unrecoverable state and require a hard reset 
(power off, power on).

While we've been working around this for a while and assumed it was just a 
Debian quirk, I was able to obtain the following trace from the console of a 
hung system today:

[820689.313769] nfs: server 192.168.1.1 not responding, still trying
[820693.530338] nfs: server 192.168.1.1 not responding, still trying
[820693.530451] nfs: server 192.168.1.1 not responding, still trying
[820696.994677] nfs: server 192.168.1.1 not responding, still trying
[820697.218891] nfs: server 192.168.1.1 not responding, still trying
[820697.698918] nfs: server 192.168.1.1 not responding, still trying
[820698.106834] nfs: server 192.168.1.1 not responding, still trying
[820721.177609] nfs: server 192.168.1.1 not responding, still trying
[820725.466102] nfs: server 192.168.1.1 not responding, still trying
[820818.681006] watchdog: BUG: soft lockup - CPU#2 stuck for 21s! 
[systemd-logind:273]
[820932.960202] INFO: task openvpn:5096 blocked for more than 120 seconds.
[820937.889046] nfs: server 192.168.1.1 OK
[820937.889226] nfs: server 192.168.1.1 OK
[820937.889374] nfs: server 192.168.1.1 OK
[820937.889381] nfs: server 192.168.1.1 OK
[820937.889448] nfs: server 192.168.1.1 OK
[820937.889503] nfs: server 192.168.1.1 OK
[820937.889574] nfs: server 192.168.1.1 OK
[820937.889665] nfs: server 192.168.1.1 OK
[820937.889670] nfs: server 192.168.1.1 OK
[820937.889674] nfs: server 192.168.1.1 OK
[820937.903880] systemd-journald[171]: Failed to open system journal: 
Permission denied
[820938.083071] systemd[1]: systemd-journald.service: Main process exited, 
code=killed, status=6/ABRT
[820938.111157] systemd[1]: systemd-journald.service: Failed to kill control 
group /system.slice/systemd-journald.service, ignoring: Permission denied
[820938.124774] systemd[1]: systemd-journald.service: Failed to kill control 
group /system.slice/systemd-journald.service, ignoring: Permission denied
[820938.131244] systemd[1]: systemd-journald.service: Unit entered failed state.
[820938.131418] systemd[1]: systemd-journald.service: Failed with result 
'watchdog'.
[820938.144754] systemd[1]: systemd-udevd.service: Main process exited, 
code=killed, status=6/ABRT
[820938.170807] systemd[1]: systemd-udevd.service: Failed to kill control group 
/system.slice/systemd-udevd.service, ignoring: Permission denied
[820938.177666] systemd[1]: systemd-udevd.service: Unit entered failed state.
[820938.177798] systemd[1]: systemd-udevd.service: Failed with result 
'watchdog'.
[820938.189036] systemd[1]: systemd-udevd.service: Service has no hold-off 
time, scheduling restart.

This fairly clearly puts the blame somewhere in systemd, which makes sense as 
our older non-systemd machines recover perfectly fine from even extended NFS 
server failures.  At minimum the systemd watchdog should probably be disabled 
while the NFS server is unavailable.

--- End Message ---

--- Begin Message ---

Am 24.07.19 um 00:01 schrieb Timothy Pearson:
> Package: systemd
> Version: 241-5
> 
> Ever since the switch to systemd we have been experiencing significant 
> problems with our diskless nodes, where if the NFS connection is dropped for 
> any reason (NFS server reboot, network router state reset, etc.) there is a 
> high chance the diskless nodes will enter an unrecoverable state and require 
> a hard reset (power off, power on).
> 
> While we've been working around this for a while and assumed it was just a 
> Debian quirk, I was able to obtain the following trace from the console of a 
> hung system today:
> 
> [820689.313769] nfs: server 192.168.1.1 not responding, still trying
> [820693.530338] nfs: server 192.168.1.1 not responding, still trying
> [820693.530451] nfs: server 192.168.1.1 not responding, still trying
> [820696.994677] nfs: server 192.168.1.1 not responding, still trying
> [820697.218891] nfs: server 192.168.1.1 not responding, still trying
> [820697.698918] nfs: server 192.168.1.1 not responding, still trying
> [820698.106834] nfs: server 192.168.1.1 not responding, still trying
> [820721.177609] nfs: server 192.168.1.1 not responding, still trying
> [820725.466102] nfs: server 192.168.1.1 not responding, still trying
> [820818.681006] watchdog: BUG: soft lockup - CPU#2 stuck for 21s! 
> [systemd-logind:273]
> [820932.960202] INFO: task openvpn:5096 blocked for more than 120 seconds.
> [820937.889046] nfs: server 192.168.1.1 OK
> [820937.889226] nfs: server 192.168.1.1 OK
> [820937.889374] nfs: server 192.168.1.1 OK
> [820937.889381] nfs: server 192.168.1.1 OK
> [820937.889448] nfs: server 192.168.1.1 OK
> [820937.889503] nfs: server 192.168.1.1 OK
> [820937.889574] nfs: server 192.168.1.1 OK
> [820937.889665] nfs: server 192.168.1.1 OK
> [820937.889670] nfs: server 192.168.1.1 OK
> [820937.889674] nfs: server 192.168.1.1 OK
> [820937.903880] systemd-journald[171]: Failed to open system journal: 
> Permission denied
> [820938.083071] systemd[1]: systemd-journald.service: Main process exited, 
> code=killed, status=6/ABRT
> [820938.111157] systemd[1]: systemd-journald.service: Failed to kill control 
> group /system.slice/systemd-journald.service, ignoring: Permission denied
> [820938.124774] systemd[1]: systemd-journald.service: Failed to kill control 
> group /system.slice/systemd-journald.service, ignoring: Permission denied
> [820938.131244] systemd[1]: systemd-journald.service: Unit entered failed 
> state.
> [820938.131418] systemd[1]: systemd-journald.service: Failed with result 
> 'watchdog'.
> [820938.144754] systemd[1]: systemd-udevd.service: Main process exited, 
> code=killed, status=6/ABRT
> [820938.170807] systemd[1]: systemd-udevd.service: Failed to kill control 
> group /system.slice/systemd-udevd.service, ignoring: Permission denied
> [820938.177666] systemd[1]: systemd-udevd.service: Unit entered failed state.
> [820938.177798] systemd[1]: systemd-udevd.service: Failed with result 
> 'watchdog'.
> [820938.189036] systemd[1]: systemd-udevd.service: Service has no hold-off 
> time, scheduling restart.
> 
> This fairly clearly puts the blame somewhere in systemd, which makes sense as 
> our older non-systemd machines recover perfectly fine from even extended NFS 
> server failures.  At minimum the systemd watchdog should probably be disabled 
> while the NFS server is unavailable.

From your log it seems that other processes get stuck as well.
For your specific case, I would recommend that you disable the Watchdog
feature for the affected services or adjust the timeout to your needs.
See
$ grep Watchdog /lib/systemd/system/systemd-*
/lib/systemd/system/systemd-hostnamed.service:WatchdogSec=3min
/lib/systemd/system/systemd-journald.service:WatchdogSec=3min
/lib/systemd/system/systemd-localed.service:WatchdogSec=3min
/lib/systemd/system/systemd-logind.service:WatchdogSec=3min
/lib/systemd/system/systemd-networkd.service:WatchdogSec=3min
/lib/systemd/system/systemd-resolved.service:WatchdogSec=3min
/lib/systemd/system/systemd-timedated.service:WatchdogSec=3min
/lib/systemd/system/systemd-timesyncd.service:WatchdogSec=3min
/lib/systemd/system/systemd-udevd.service:WatchdogSec=3min

To disable/modify the watchdog, I would recommend adding a drop-in
config file. You can easily do that via e.g.
systemctl edit systemd-udevd.service

Then either use
  [Service]
  WatchdogSec=0

(0 disables the Watchdog)

or
  [Service]
  WatchdogSec=600

(wait for up to 10min)

See man systemd.service.

I don't see how systemd could reliably detect that the cause for hanging
service is an unavailabe NFS server, and I'm not convinced systemd
should try to be clever here.

Regards,
Michael

signature.asc
Description: OpenPGP digital signature

--- End Message ---

_______________________________________________
Pkg-systemd-maintainers mailing list
[email protected]
https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/pkg-systemd-maintainers

Bug#932851: marked as done (systemd causes diskless nodes to stop working / require hard reset)

Reply via email to