Corey Minyard <[email protected]> wrote:
>On Fri, Aug 12, 2022 at 04:33:18PM -0700, Jay Vosburgh wrote:
>> We have observed issues wherein the IPMI driver will sleep forever in
>> uninterruptible wait with mutexes held (specifically, dyn_mutex and
>> bmc_reg_mutex, acquired by __bmc_get_device_id). This occurs ultimately
>> due to a BMC firmware bug; the BMC will fail to respond to requests,
>> apparently related to the request rate, and the current logic will wait
>> forever.
>
>This really isn't the right fix. The state machines running the
>interfaces are required to time out after a period of time, usually
>5 seconds, but that depends on how the hardware is behaving, or
>misbehaving in this case. So though these are not timed mutexes, what
>is running below should be timed, so it shouldn't be necessary here.
We did try and follow the code paths to figure out what's going
on at the lower levels, but the IPMI logic is fairly complex.
As best we could determine, the send path for __bmc_get_device_id
is something like:
__bmc_get_device_id
__get_guid
send_guid_cmd
i_ipmi_request
smi_send
wait_event
where i_ipmi_request() calls i_ipmi_req_sysintf(), because
send_guid_cmd() sets addr_type == IPMI_SYSTEM_INTERFACE_ADDR_TYPE, and
then smi_send() to issue the request.
send_guid_cmd
i_ipmi_request
i_ipmi_req_sysintf
smi_send
I presume that if the request goes out via SMI, the response
should arrive the same way (perhaps this is incorrect).
It looks like the only place outside of the "make a request"
path that will tweak the dyn_guid_set (needed to exit the wait_event()
call) is within guid_handler(), which is the intf->null_user_handler set
by __get_guid(), which in turn seems to be only called by
deliver_response().
Where are the state machines for the interfaces that should time
out? Do they (or should they) call into deliver_response() to break out
of the wait_event()?
>What is the particular hardware involved? The buggy hardware may be
>exercising a software bug.
The systems are white box servers from Ciara, the BMC chipset is
an AST2500.
-J
>-corey
>
>>
>> When the problem occurs, as each successive process queries the BMC,
>> they will block in D state waiting to acquire the mutex, and with time
>> the machine hangs. The BMC vendor has agreed it may be a hardware fault.
>>
>> This patch addresses the problem by replacing wait_event() with
>> wait_event_timeout(). If the event times out (meaning the problem has
>> occurred) the bmc->dyn_id_set and bmc->dyn_guid_set are set to 0 and the
>> process eventually returns.
>>
>> Fixes: aa9c9ab2443e ("ipmi: allow dynamic BMC version information")
>> Signed-off-by: Jay Vosburgh <[email protected]>
>> Signed-off-by: Ioanna Alifieraki <[email protected]>
>>
>> ---
>> drivers/char/ipmi/ipmi_msghandler.c | 14 ++++++++++----
>> 1 file changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/char/ipmi/ipmi_msghandler.c
>> b/drivers/char/ipmi/ipmi_msghandler.c
>> index 703433493c85..a510839853b5 100644
>> --- a/drivers/char/ipmi/ipmi_msghandler.c
>> +++ b/drivers/char/ipmi/ipmi_msghandler.c
>> @@ -2572,7 +2572,9 @@ static int __get_device_id(struct ipmi_smi *intf,
>> struct bmc_device *bmc)
>> if (rv)
>> goto out_reset_handler;
>>
>> - wait_event(intf->waitq, bmc->dyn_id_set != 2);
>> + rv = wait_event_timeout(intf->waitq, bmc->dyn_id_set != 2, HZ * 5);
>> + if (!rv)
>> + bmc->dyn_id_set = 0;
>>
>> if (!bmc->dyn_id_set) {
>> if (bmc->cc != IPMI_CC_NO_ERROR &&
>> @@ -3337,11 +3339,15 @@ static void __get_guid(struct ipmi_smi *intf)
>> bmc->dyn_guid_set = 2;
>> intf->null_user_handler = guid_handler;
>> rv = send_guid_cmd(intf, 0);
>> - if (rv)
>> + if (rv) {
>> /* Send failed, no GUID available. */
>> bmc->dyn_guid_set = 0;
>> - else
>> - wait_event(intf->waitq, bmc->dyn_guid_set != 2);
>> + } else {
>> + rv = wait_event_timeout(intf->waitq, bmc->dyn_guid_set != 2,
>> + HZ * 5);
>> + if (!rv)
>> + bmc->dyn_guid_set = 0;
>> + }
>>
>> /* dyn_guid_set makes the guid data available. */
>> smp_rmb();
>> --
>> 2.34.1
>>
---
-Jay Vosburgh, [email protected]
_______________________________________________
Openipmi-developer mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/openipmi-developer