On Mon, Aug 15, 2022 at 06:15:40PM -0700, Jay Vosburgh wrote:
> Corey Minyard <[email protected]> wrote:
> 
> >On Fri, Aug 12, 2022 at 04:33:18PM -0700, Jay Vosburgh wrote:
> >> We have observed issues wherein the IPMI driver will sleep forever in
> >> uninterruptible wait with mutexes held (specifically, dyn_mutex and
> >> bmc_reg_mutex, acquired by __bmc_get_device_id).  This occurs ultimately
> >> due to a BMC firmware bug; the BMC will fail to respond to requests,
> >> apparently related to the request rate, and the current logic will wait
> >> forever.
> >
> >This really isn't the right fix.  The state machines running the
> >interfaces are required to time out after a period of time, usually
> >5 seconds, but that depends on how the hardware is behaving, or
> >misbehaving in this case.  So though these are not timed mutexes, what
> >is running below should be timed, so it shouldn't be necessary here.
> 
>       We did try and follow the code paths to figure out what's going
> on at the lower levels, but the IPMI logic is fairly complex.

Yes.  IPMI defines four different interfaces, and the code makes them
all look pretty much the same.  There are a large number of ways
interfaces can be discovered or added (hard coded, hot plug, pci,
OF, ACPI, SMBIOS, and machine-specific methods) and three different ways
to talk to the hardware (port, memory mapped, and SMBus).  Plus
interrupts or no interrupts, different register spacing and layout.  It
leads to an unfortunate amount of complexity.

The ipmi_msghandler.c code is mostly a message router, not interface
handling.  You can ignore most of that code and look right at the
interface you are using.

> 
>       As best we could determine, the send path for __bmc_get_device_id
> is something like:
> 
> __bmc_get_device_id
>       __get_guid
>               send_guid_cmd
>                       i_ipmi_request
>                               smi_send
>               wait_event
> 
>       where i_ipmi_request() calls i_ipmi_req_sysintf(), because
> send_guid_cmd() sets addr_type == IPMI_SYSTEM_INTERFACE_ADDR_TYPE, and
> then smi_send() to issue the request.
> 
> send_guid_cmd
>       i_ipmi_request
>               i_ipmi_req_sysintf
>               smi_send
> 
>       I presume that if the request goes out via SMI, the response
> should arrive the same way (perhaps this is incorrect).

It should come back from the same place you sent it to, yes.  But
smi_send itself is not specific.

> 
>       It looks like the only place outside of the "make a request"
> path that will tweak the dyn_guid_set (needed to exit the wait_event()
> call) is within guid_handler(), which is the intf->null_user_handler set
> by __get_guid(), which in turn seems to be only called by
> deliver_response().
> 
>       Where are the state machines for the interfaces that should time
> out?  Do they (or should they) call into deliver_response() to break out
> of the wait_event()?

The state machine depends on the low-leel interface.  If it is kcs, bt,
or smic, it uses ipmi_si_intf.c, which then uses one of ipmi_kcs_sm.c,
ipmi_bt_sm.c, or ipmi_smic_sm.c to run the low-level state machine.

There is debugging you can enable in ipmi_smi_intf.c and in each
low-level driver.

If your interface is ssif (SMBus based), then ipmi_ssif.c contains the
code.  You can enabled debugging there, too.

We have had this issue before, and someone suggested the same fix you
did, but we ended up fixing the issue in the state machine.  If you fix
this in the upper level like you have, you will just have the same issue
again if the user software does the same command the BMC does here.
Which is quite likely.

If worse comes to worse, we can add a workaround for your specific
hardware.  Which has also been done.  But fixing the state machine is
really the right thing to do.

-corey

> 
> >What is the particular hardware involved?  The buggy hardware may be
> >exercising a software bug.
> 
>       The systems are white box servers from Ciara, the BMC chipset is
> an AST2500.
> 
>       -J
> 
> >-corey
> >
> >> 
> >> When the problem occurs, as each successive process queries the BMC,
> >> they will block in D state waiting to acquire the mutex, and with time
> >> the machine hangs. The BMC vendor has agreed it may be a hardware fault.
> >> 
> >> This patch addresses the problem by replacing wait_event() with
> >> wait_event_timeout(). If the event times out (meaning the problem has
> >> occurred) the bmc->dyn_id_set and bmc->dyn_guid_set are set to 0 and the
> >> process eventually returns.
> >> 
> >> Fixes: aa9c9ab2443e ("ipmi: allow dynamic BMC version information")
> >> Signed-off-by: Jay Vosburgh <[email protected]>
> >> Signed-off-by: Ioanna Alifieraki <[email protected]>
> >> 
> >> ---
> >>  drivers/char/ipmi/ipmi_msghandler.c | 14 ++++++++++----
> >>  1 file changed, 10 insertions(+), 4 deletions(-)
> >> 
> >> diff --git a/drivers/char/ipmi/ipmi_msghandler.c 
> >> b/drivers/char/ipmi/ipmi_msghandler.c
> >> index 703433493c85..a510839853b5 100644
> >> --- a/drivers/char/ipmi/ipmi_msghandler.c
> >> +++ b/drivers/char/ipmi/ipmi_msghandler.c
> >> @@ -2572,7 +2572,9 @@ static int __get_device_id(struct ipmi_smi *intf, 
> >> struct bmc_device *bmc)
> >>    if (rv)
> >>            goto out_reset_handler;
> >>  
> >> -  wait_event(intf->waitq, bmc->dyn_id_set != 2);
> >> +  rv = wait_event_timeout(intf->waitq, bmc->dyn_id_set != 2, HZ * 5);
> >> +  if (!rv)
> >> +          bmc->dyn_id_set = 0;
> >>  
> >>    if (!bmc->dyn_id_set) {
> >>            if (bmc->cc != IPMI_CC_NO_ERROR &&
> >> @@ -3337,11 +3339,15 @@ static void __get_guid(struct ipmi_smi *intf)
> >>    bmc->dyn_guid_set = 2;
> >>    intf->null_user_handler = guid_handler;
> >>    rv = send_guid_cmd(intf, 0);
> >> -  if (rv)
> >> +  if (rv) {
> >>            /* Send failed, no GUID available. */
> >>            bmc->dyn_guid_set = 0;
> >> -  else
> >> -          wait_event(intf->waitq, bmc->dyn_guid_set != 2);
> >> +  } else {
> >> +          rv = wait_event_timeout(intf->waitq, bmc->dyn_guid_set != 2,
> >> +                                  HZ * 5);
> >> +          if (!rv)
> >> +                  bmc->dyn_guid_set = 0;
> >> +  }
> >>  
> >>    /* dyn_guid_set makes the guid data available. */
> >>    smp_rmb();
> >> -- 
> >> 2.34.1
> >> 
> 
> ---
>       -Jay Vosburgh, [email protected]


_______________________________________________
Openipmi-developer mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/openipmi-developer

Reply via email to