When the BMC resets while the IPMI watchdog is active, the driver has
three failure modes depending on timing:
1. list_add double add panic -- the watchdog daemon retries while the
static smi_msg/recv_msg structures are still queued in the IPMI
layer from the previous (unanswered) request.
2. D-state hang -- wait_for_completion() blocks indefinitely because
the BMC never delivers a response.
3. Silent loss of watchdog protection -- the BMC returns a non-zero
completion code, the driver's internal state becomes inconsistent,
writes to /dev/watchdog return -EINVAL, and the daemon gives up.
The system continues running without hardware watchdog coverage.
All three stem from the same root cause: the static message structures
and unbounded completion waits were never designed for a BMC that
disappears mid-transaction.
This has been independently reported by Kenta Akagi on a Dell PowerEdge
R640 running 6.18.7, also triggered by a BMC reset with the watchdog
active:
https://sourceforge.net/p/openipmi/mailman/message/59292850/
The fix takes a simple, deterministic approach: detect the failure via
BMC error response, guard against structure reuse (msg_in_flight) and
indefinite waits (completion timeout), then initiate orderly_reboot()
when the watchdog is active. This produces the same outcome the
hardware watchdog would have -- a system reset -- but through a
controlled path with clear logging and no panics or hangs.
If the watchdog is stopped when the BMC resets, no reboot occurs and
the system continues normally.
Tested on Dell PowerEdge R640 with kernel 5.14 (RHEL 9) and verified
against mainline (both patches apply cleanly).
Corey Minyard's recent fix for list corruption in smi_work()
(ipmi_msghandler.c) addresses a related but separate code path. The
watchdog driver's own static structure reuse requires this fix.
Tony Camuso (2):
ipmi:watchdog: Reboot cleanly on BMC reset
Documentation: ipmi: Update BMC reset behavior for watchdog
Documentation/driver-api/ipmi.rst | 61 ++++++++++++++++++
drivers/char/ipmi/ipmi_watchdog.c | 101 ++++++++++++++++++++++++------
2 files changed, 144 insertions(+), 18 deletions(-)
--
2.53.0
_______________________________________________
Openipmi-developer mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/openipmi-developer