Re: Automatic reboot on kernel crash in Debian 12 - how?

2024-04-16 Thread Max Nikulin

On 16/04/2024 16:17, Michael Kjörling wrote:

I have a handful of Debian 12 systems that I want to configure such
that they reboot automatically in case of a problem.

[...]

That leaves kernel-level issues.


I have not tried it, but I have seen some systemd options related to 
configuration of hardware watchdog: systemd.directives(7) and 
systemd-system.conf(5)





Re: Automatic reboot on kernel crash in Debian 12 - how?

2024-04-16 Thread Franco Martelli

On 16/04/24 at 11:17, Michael Kjörling wrote:

Do I need to set some more settings to ensure that the system will
automatically reboot on a panic? If so, what?


Hi,

In the Linux kernel source are available two options to reboot on panic:

config BOOTPARAM_SOFTLOCKUP_PANIC
bool "Panic (Reboot) On Soft Lockups"
depends on SOFTLOCKUP_DETECTOR
help
  Say Y here to enable the kernel to panic on "soft lockups",
  which are bugs that cause the kernel to loop in kernel
  mode for more than 20 seconds (configurable using the 
watchdog_thresh

  sysctl), without giving other tasks a chance to run.

  The panic can be used in combination with panic_timeout,
  to cause the system to reboot automatically after a
  lockup has been detected. This feature is useful for
  high-availability systems that have uptime guarantees and
  where a lockup must be resolved ASAP.

  Say N if unsure.

and:

config BOOTPARAM_HARDLOCKUP_PANIC
bool "Panic (Reboot) On Hard Lockups"
depends on HARDLOCKUP_DETECTOR
help
  Say Y here to enable the kernel to panic on "hard lockups",
  which are bugs that cause the kernel to loop in kernel
  mode with interrupts disabled for more than 10 seconds 
(configurable

  using the watchdog_thresh sysctl).

  Say N if unsure.

from Documentation/admin-guide/kernel-parameters.txt you can set it as 
kernel parameter or via sysctls:


softlockup_panic=
[KNL] Should the soft-lockup detector generate 
panics.

Format: 0 | 1

A value of 1 instructs the soft-lockup detector
to panic the machine when a soft-lockup occurs. 
It is
also controlled by the kernel.softlockup_panic 
sysctl

and CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC, which is the
respective build-time switch to that functionality.

and the same for "kernel.hardlockup_panic" that it seems it hasn't an 
help entry in the documentation file, I found it here:



nmi_watchdog=   [KNL,BUGS=X86] Debugging features for SMP kernels
Format: [panic,][nopanic,][num]
Valid num: 0 or 1
0 - turn hardlockup detector in nmi_watchdog off
1 - turn hardlockup detector in nmi_watchdog on
When panic is specified, panic when an NMI watchdog
timeout occurs (or 'nopanic' to not panic on an NMI
watchdog, if CONFIG_BOOTPARAM_HARDLOCKUP_PANIC 
is set)

To disable both hard and soft lockup detectors,
please see 'nowatchdog'.
This is useful when you use a panic=... timeout and
need the box quickly up again.

These settings can be accessed at runtime via
the nmi_watchdog and hardlockup_panic sysctls.

To learn more I suggest to install the "linux-source-6.1" package and 
investigate the "Watchdog" option, it is under "Device Drivers".
The BOOTPARAM_SOFTLOCKUP_PANIC and BOOTPARAM_HARDLOCKUP_PANIC options 
are under "Kernel hacking" → "Debug Oops, Lockups and Hangs".


Cheers
--
Franco Martelli



Re: Automatic reboot on kernel crash in Debian 12 - how?

2024-04-16 Thread Michael Kjörling
On 16 Apr 2024 11:42 +0200, from geo...@nsup.org (Nicolas George):
>> Are you saying that the settings themselves are reasonable for the
>> purpose, and that this particular crash just happened to be such a one
>> that no software running on the system in question can reasonably help
>> with that scenario?
> 
> No, unfortunately I do not have the gift of divination, it would be
> convenient. I am saying that you cannot use software to protect yourself
> entirely from software bugs.

Well, naturally. But if there is some setting which I _could_ set that
would get me closer to my desired state, I would still like to know
which one and perhaps even what might be an appropriate value for it.

-- 
Michael Kjörling  https://michael.kjorling.se
“Remember when, on the Internet, nobody cared that you were a dog?”



Re: Automatic reboot on kernel crash in Debian 12 - how?

2024-04-16 Thread Nicolas George
Michael Kjörling (12024-04-16):
> Are you saying that the settings themselves are reasonable for the
> purpose, and that this particular crash just happened to be such a one
> that no software running on the system in question can reasonably help
> with that scenario?

No, unfortunately I do not have the gift of divination, it would be
convenient. I am saying that you cannot use software to protect yourself
entirely from software bugs.

> This happened on a VM that I can't directly influence the hardware
> configuration of (a commercially provided VPS), but I should be able
> to jury-rig something using the provider's API if necessary.

You probably can. But first check if your VM has an emulated hardware
watchdog.

Regards,

-- 
  Nicolas George



Re: Automatic reboot on kernel crash in Debian 12 - how?

2024-04-16 Thread Michael Kjörling
On 16 Apr 2024 11:22 +0200, from geo...@nsup.org (Nicolas George):
>> Do I need to set some more settings to ensure that the system will
>> automatically reboot on a panic? If so, what?
> 
> If the crash was bad enough to freeze the kernel before it could
> trigger the reboot, there is nothing the software can do.
> 
> You need a hardware watchdog.

Are you saying that the settings themselves are reasonable for the
purpose, and that this particular crash just happened to be such a one
that no software running on the system in question can reasonably help
with that scenario?

This happened on a VM that I can't directly influence the hardware
configuration of (a commercially provided VPS), but I should be able
to jury-rig something using the provider's API if necessary.

-- 
Michael Kjörling  https://michael.kjorling.se
“Remember when, on the Internet, nobody cared that you were a dog?”



Re: Automatic reboot on kernel crash in Debian 12 - how?

2024-04-16 Thread Nicolas George
Michael Kjörling (12024-04-16):
> However, this morning I woke up to one of those systems showing a
> kernel crash dump and being frozen. Unfortunately the first part of
> the crash dump had scrolled past so I couldn't tell what class of
> problem caused the crash.
> 
> Do I need to set some more settings to ensure that the system will
> automatically reboot on a panic? If so, what?

If the crash was bad enough to freeze the kernel before it could
trigger the reboot, there is nothing the software can do.

You need a hardware watchdog. If your motherboard has one, just install
and enable the corresponding daemon, and check it works by SIGSTOPing
it.

If your motherboard does not have one, you can probably DIY one from a
RPi or an Arduino.

Regards,

-- 
  Nicolas George



Automatic reboot on kernel crash in Debian 12 - how?

2024-04-16 Thread Michael Kjörling
I have a handful of Debian 12 systems that I want to configure such
that they reboot automatically in case of a problem. I have set them
up with userspace scripts (executed through cron) to reboot if
something goes wrong there; that appears to work as expected if I
induce an issue that those scripts check for. That leaves kernel-level
issues.

To try to configure this, I have created a file
/etc/sysctl.d/local.conf (owned by root:root, mode 0644).

# cat /etc/sysctl.d/local.conf
kernel.panic = 120
kernel.panic_on_oops = 1
kernel.panic_on_stackoverflow = 1
kernel.panic_on_io_nmi = 1
#

With the exception of panic_on_stackoverflow, as far as I can tell
these are in effect after a reboot:

#  sysctl kernel.panic kernel.panic_on_oops kernel.panic_on_stackoverflow 
kernel.panic_on_io_nmi
kernel.panic = 120
kernel.panic_on_oops = 1
sysctl: cannot stat /proc/sys/kernel/panic_on_stackoverflow: No such file or 
directory
kernel.panic_on_io_nmi = 1
#

However, this morning I woke up to one of those systems showing a
kernel crash dump and being frozen. Unfortunately the first part of
the crash dump had scrolled past so I couldn't tell what class of
problem caused the crash.

Do I need to set some more settings to ensure that the system will
automatically reboot on a panic? If so, what?

I know that best is to not crash; this is _in case of_.

-- 
Michael Kjörling  https://michael.kjorling.se
“Remember when, on the Internet, nobody cared that you were a dog?”