On pseries LPAR systems in a high-availability environment using the SBD[1][2] service, I observed that the system abruptly rebooted before dump capture could complete.
Further investigation showed that SBD had configured a watchdog with a 30-second timeout. Since the kernel crashes directly into the kdump kernel without shutting down userspace services, the watchdog remained active during dump capture. Once the watchdog timeout expired, PHYP reset the LPAR, causing dump capture to fail. The issue was reproducible only when the watchdog was active. Dump capture completed successfully after disabling the watchdog, stopping the SBD service, or increasing the watchdog timeout value. This patch fixes the issue by stopping all active watchdogs on the crash shutdown path before booting the kdump kernel. Driver that export the hardware watchdog device is: drivers/watchdog/pseries-wdt.c [1] https://github.com/clusterlabs/sbd/blob/main/man/sbd.8.pod.in [2] https://documentation.suse.com/sle-ha/15-SP4/html/SLE-HA-all/cha-ha-storage-protect.html This issue can be reproduce using below program: #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <unistd.h> #include <signal.h> #include <errno.h> #include <sys/ioctl.h> #include <linux/watchdog.h> #define WATCHDOG_DEV "/dev/watchdog" #define TIMEOUT 10 #define PET_INTERVAL 1 static int wdt_fd = -1; static void watchdog_close(int disarm) { int flags; if (wdt_fd < 0) return; if (disarm) { flags = WDIOS_DISABLECARD; if (ioctl(wdt_fd, WDIOC_SETOPTIONS, &flags) < 0) printf("WDIOS_DISABLECARD failed: %m (nowayout may be set)\n"); else printf("Watchdog disabled via WDIOS_DISABLECARD\n"); if (write(wdt_fd, "V", 1) < 0) printf("Magic 'V' write failed: %m\n"); else printf("Magic 'V' written\n"); } else { printf("Closing WITHOUT disarming - watchdog keeps running!\n"); } close(wdt_fd); wdt_fd = -1; printf("Watchdog fd closed\n"); } static void safe_exit(int sig) { printf("\nSignal %d received - disarming watchdog...\n", sig); watchdog_close(1); exit(0); } static int watchdog_init(void) { int flags, timeout = TIMEOUT; struct watchdog_info ident; printf("Opening %s...\n", WATCHDOG_DEV); wdt_fd = open(WATCHDOG_DEV, O_WRONLY); if (wdt_fd < 0) { printf("Failed to open %s: %m\n", WATCHDOG_DEV); return -1; } printf("Watchdog opened and ARMED\n"); flags = WDIOS_ENABLECARD; if (ioctl(wdt_fd, WDIOC_SETOPTIONS, &flags) < 0) /* ENOTTY = driver always enabled, that's fine */ printf("WDIOS_ENABLECARD: %m (ok if ENOTTY)\n"); else printf("Watchdog enabled via WDIOS_ENABLECARD\n"); if (ioctl(wdt_fd, WDIOC_SETTIMEOUT, &timeout) < 0) printf("WDIOC_SETTIMEOUT failed: %m\n"); else printf("Timeout set to %d seconds\n", timeout); /* verify what the driver actually set */ if (ioctl(wdt_fd, WDIOC_GETTIMEOUT, &timeout) == 0) printf("Actual timeout : %d seconds\n", timeout); if (ioctl(wdt_fd, WDIOC_GETSUPPORT, &ident) == 0) printf("Identity : %s\n", ident.identity); return 0; } static void watchdog_tickle(void) { int timeleft = 0; if (ioctl(wdt_fd, WDIOC_KEEPALIVE, 0) < 0) { printf("WDIOC_KEEPALIVE failed: %m - falling back to write\n"); write(wdt_fd, "1", 1); } if (ioctl(wdt_fd, WDIOC_GETTIMELEFT, &timeleft) == 0) printf("Petted watchdog. Timeleft: %d sec\n", timeleft); else printf("Petted watchdog.\n"); } int main(void) { signal(SIGINT, safe_exit); signal(SIGTERM, safe_exit); if (watchdog_init() < 0) return 1; printf("\nPetting every %d seconds. Ctrl+C to safely stop.\n\n", PET_INTERVAL); while (1) { watchdog_tickle(); sleep(PET_INTERVAL); } return 0; } Steps to reproduce the issue: ----------------------------- 1. Insert pseries-wdt driver 2. Compile the above proram and run the binary 3. Crash the kernel Sourabh Jain (1): powerpc/crash: stop watchdogs before booting kdump kernel arch/powerpc/kexec/crash.c | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) -- 2.52.0
