This issue was found when an EFI pstore was configured for kdump
logging with the NMI hard lockup detector enabled. The efi-pstore
write operation was slow, and with a large number of logs, the
pstore dump callback within kmsg_dump() took a long time.

This delay triggered the NMI watchdog, leading to a nested panic.
The call flow demonstrates how the secondary panic caused an
emergency_restart() to be triggered before the initial pstore
operation could finish, leading to a failure to dump the logs:

  real panic() {
        kmsg_dump() {
                ...
                pstore_dump() {
                        start_dump();
                        ... // long time operation triggers NMI watchdog
                        nmi panic() {
                                ...
                                emergency_restart(); // pstore unfinished
                        }
                        ...
                        finish_dump(); // never reached
                }
        }
  }

Both watchdog_buddy_check_hardlockup() and watchdog_overflow_callback() may
trigger during a panic. This can lead to recursive panic handling.

Add panic_in_progress() checks so watchdog activity is skipped once a panic
has begun.

This prevents recursive panic and keeps the panic path more reliable.

Signed-off-by: Jinchao Wang <wangjinchao...@gmail.com>
Reviewed-by: Yury Norov (NVIDIA) <yury.no...@gmail.com>
---
 kernel/watchdog.c      | 6 ++++++
 kernel/watchdog_perf.c | 4 ++++
 2 files changed, 10 insertions(+)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 80b56c002c7f..597c0d947c93 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -740,6 +740,12 @@ static enum hrtimer_restart watchdog_timer_fn(struct 
hrtimer *hrtimer)
        if (!watchdog_enabled)
                return HRTIMER_NORESTART;
 
+       /*
+        * pass the buddy check if a panic is in process
+        */
+       if (panic_in_progress())
+               return HRTIMER_NORESTART;
+
        watchdog_hardlockup_kick();
 
        /* kick the softlockup detector */
diff --git a/kernel/watchdog_perf.c b/kernel/watchdog_perf.c
index 9c58f5b4381d..d3ca70e3c256 100644
--- a/kernel/watchdog_perf.c
+++ b/kernel/watchdog_perf.c
@@ -12,6 +12,7 @@
 
 #define pr_fmt(fmt) "NMI watchdog: " fmt
 
+#include <linux/panic.h>
 #include <linux/nmi.h>
 #include <linux/atomic.h>
 #include <linux/module.h>
@@ -108,6 +109,9 @@ static void watchdog_overflow_callback(struct perf_event 
*event,
        /* Ensure the watchdog never gets throttled */
        event->hw.interrupts = 0;
 
+       if (panic_in_progress())
+               return;
+
        if (!watchdog_check_timestamp())
                return;
 
-- 
2.43.0


Reply via email to