On 4/7/2026 5:54 PM, Corey Minyard wrote:
On Tue, Apr 07, 2026 at 01:51:32PM -0400, Tony Camuso wrote:
When the BMC resets while the IPMI watchdog is active, the driver has
three failure modes depending on timing:
1. list_add double add panic -- the watchdog daemon retries while the
static smi_msg/recv_msg structures are still queued in the IPMI
layer from the previous (unanswered) request.
I'm trying to make sense of this. Are you sure this didn't start
happening after you added a timeout on the wait_for_completion()?
Otherwise it would never return, the mutex would be held, and no new
message could be added.
Just timing out in wait_for_completion() there could cause all kinds of
bad things to happen.
You're right. This work was done on a RHEL 9 kernel that did not yet have
your recent upstream KCS/SI fixes applied, so some of the behavior I
observed may have been caused/influenced by bugs you've more recently
addressed.
2. D-state hang -- wait_for_completion() blocks indefinitely because
the BMC never delivers a response.
This is an issue. The lower level driver is *always* supposed to return
a failure. Something else needs to be fixed.
I have seen several creative ways in which BMCs "fail to respond" that
have confused the lower level drivers. If my guess is correct, there's
a bug in the low level driver that's causing it to not time out the
message.
If we don't fix this, it will cause other issues outside the watchdog.
Agreed -- the D-state hang is a symptom, not the root cause. If the
KCS driver correctly transitions through error recovery to
SI_SM_HOSED, and the SI layer returns an error completion to the
caller, then wait_for_completion() should never block indefinitely.
To get to the bottom of this, I've instrumented three layers:
- ipmi_kcs_sm.c: trace entry into start_error_recovery() and the
transition to KCS_HOSED after MAX_ERROR_RETRIES
- ipmi_si_intf.c: trace return_hosed_msg(), the SI_SM_HOSED
handler in smi_event_handler(), and HOSED recovery in
smi_timeout()
- ipmi_watchdog.c: trace message send/completion in
_ipmi_set_timeout() and __ipmi_heartbeat(), and the completion
code received in ipmi_wdog_msg_handler()
I've applied your recent upstream patches to my test kernel, so the
KCS/SI code is congruent with current mainline. The traces will show
whether the error recovery chain works correctly with your fixes in
place, or whether the BMC is doing something that still confuses the
low-level driver.
I'll collect the data and follow up.
3. Silent loss of watchdog protection -- the BMC returns a non-zero
completion code, the driver's internal state becomes inconsistent,
writes to /dev/watchdog return -EINVAL, and the daemon gives up.
The system continues running without hardware watchdog coverage.
Again, are you sure this didn't start happening after you added the
timeout?
I think this one is pre-existing, independent of any timeout
changes. When the BMC comes back after a reset and returns a
non-zero completion code (e.g. 0xD5 or 0xFF), the watchdog handler
treats this as a permanent failure. The userspace daemon sees
-EINVAL on subsequent writes to /dev/watchdog and stops retrying.
The system continues running without hardware watchdog coverage,
with no indication to the administrator.
But I need to confirm this with the instrumented traces on the
the patched kernel.
I should have traces collected within the next week or so.
Tony
All three stem from the same root cause: the static message structures
and unbounded completion waits were never designed for a BMC that
disappears mid-transaction.
All that is supposed to be protected by a mutex. That mutex is claimed
on all IPMI watchdog operations, and it shouldn't be released until all
resources have been freed. Anything that violates that is asking for
trouble.
You don't mention the lower level interface (KCS, BT, SMIC, SSIF) but I
think we need to start looking there.
It may be that the timeouts on the watchdog messages need to be
adjusted. The whole IPMI driver was designed on the presumption that
the BMC would go away for only a short period of time (5-10 seconds) and
not permanantly. That has slowly been fixed over time, but things might
need to be adjusted in the watchdog.
-corey
This has been independently reported by Kenta Akagi on a Dell PowerEdge
R640 running 6.18.7, also triggered by a BMC reset with the watchdog
active:
https://sourceforge.net/p/openipmi/mailman/message/59292850/
The fix takes a simple, deterministic approach: detect the failure via
BMC error response, guard against structure reuse (msg_in_flight) and
indefinite waits (completion timeout), then initiate orderly_reboot()
when the watchdog is active. This produces the same outcome the
hardware watchdog would have -- a system reset -- but through a
controlled path with clear logging and no panics or hangs.
If the watchdog is stopped when the BMC resets, no reboot occurs and
the system continues normally.
Tested on Dell PowerEdge R640 with kernel 5.14 (RHEL 9) and verified
against mainline (both patches apply cleanly).
Corey Minyard's recent fix for list corruption in smi_work()
(ipmi_msghandler.c) addresses a related but separate code path. The
watchdog driver's own static structure reuse requires this fix.
Tony Camuso (2):
ipmi:watchdog: Reboot cleanly on BMC reset
Documentation: ipmi: Update BMC reset behavior for watchdog
Documentation/driver-api/ipmi.rst | 61 ++++++++++++++++++
drivers/char/ipmi/ipmi_watchdog.c | 101 ++++++++++++++++++++++++------
2 files changed, 144 insertions(+), 18 deletions(-)
--
2.53.0
_______________________________________________
Openipmi-developer mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/openipmi-developer