On Sun, Sep 13, 2020 at 02:10:01PM +0000, Tianxianting wrote: > Hi Corey > Thanks for your quickly reply, > We didn't try the method you mentioned, actually, I didn't know it before you > told me:( > The issue ever occurred on our 2 ceph storage server both with low > probability. > We finally use this patch to solve the issue, it can automatically solve the > issue when it happened. So no need to judge and reload ipmi driver manually > or develop additional scripts to make it. > The 1 second delay is acceptable to us. > > If there really isn't a BMC out there, ipmi driver will not be loaded, is it > right?
No, there are systems that have IPMI controllers that are specified in the ACPI/SMBIOS tables but have no IPMI controller. > > May be we can adjust to retry 3 times with 500ms interval? Maybe there is another way. What I'm guessing is happening is not that the interface is lossy, but that the BMC is in the middle of a reset or an upgrade. The D5 completion code means: Cannot execute command. Command, or request parameter(s), not supported in present state. That error is also returned from bt_start_transaction() in the driver if the driver is not in the right state when starting a transaction, which may point to a bug in the BT state machine. Search for IPMI_NOT_IN_MY_STATE_ERR in drivers/char/ipmi/ipmi_bt_sm.c. If it's coming fron the state machine, it would be handy to know what state it was in when the error happened to help trace the bug down. That's what I would suggest first. Fix the fundamental bug, if you can. a pr_warn() added to the code there would be all that's needed, and thats probably a good permanent addition. I would be ok with a patch that retried some number of times if it got a D5 completion code. That wouldn't have any effect on other systems. Plus you could add a D1 and D2 completion code, which are similar things. Add the new completion codes to include/uapi/linux/ipmi_msgdefs.h and implement the retry. That makes sense from a general point of view. Thanks, -corey > > Thanks in advance if you can feedback again. > > -----Original Message----- > From: Corey Minyard [mailto:tcminy...@gmail.com] On Behalf Of Corey Minyard > Sent: Sunday, September 13, 2020 8:40 PM > To: tianxianting (RD) <tian.xiant...@h3c.com> > Cc: a...@arndb.de; gre...@linuxfoundation.org; > openipmi-developer@lists.sourceforge.net; linux-ker...@vger.kernel.org > Subject: Re: [PATCH] ipmi: retry to get device id when error > > On Sun, Sep 13, 2020 at 08:02:03PM +0800, Xianting Tian wrote: > > We can't get bmc's device id with low probability when loading ipmi > > driver, it caused bmc device register failed. This issue may caused by > > bad lpc signal quality. When this issue happened, we got below kernel > > printks: > > [Wed Sep 9 19:52:03 2020] ipmi_si IPI0001:00: IPMI message handler: > > device id demangle failed: -22 > > [Wed Sep 9 19:52:03 2020] IPMI BT: using default values > > [Wed Sep 9 19:52:03 2020] IPMI BT: req2rsp=5 secs retries=2 > > [Wed Sep 9 19:52:03 2020] ipmi_si IPI0001:00: Unable to get the device > > id: -5 > > [Wed Sep 9 19:52:04 2020] ipmi_si IPI0001:00: Unable to register > > device: error -5 > > > > When this issue happened, we want to manually unload the driver and > > try to load it again, but it can't be unloaded by 'rmmod' as it is already > > 'in use'. > > I'm not sure this patch is a good idea; it would cause a long boot delay in > situations where there really isn't a BMC out there. Yes, it happens. > > You don't have to reload the driver to add a device, though. You can hot-add > devices using /sys/modules/ipmi_si/parameters/hotmod. Look in > Documentation/driver-api/ipmi.rst for details. > > Does that work for you? > > -corey > > > > > We add below 'printk' in handle_one_recv_msg(), when this issue > > happened, the msg we received is "Recv: 1c 01 d5", which means the > > data_len is 1, data[0] is 0xd5. > > Debug code: > > static int handle_one_recv_msg(struct ipmi_smi *intf, > > struct ipmi_smi_msg *msg) { > > printk("Recv: %*ph\n", msg->rsp_size, msg->rsp); > > ... ... > > } > > Then in ipmi_demangle_device_id(), it returned '-EINVAL' as 'data_len < 7' > > and 'data[0] != 0'. > > > > We used this patch to retry to get device id when error happen, we > > reproduced this issue again and the retry succeed on the first retry, > > we finally got the correct msg and then all is ok: > > Recv: 1c 01 00 01 81 05 84 02 af db 07 00 01 00 b9 00 10 00 > > > > So use retry machanism in this patch to give bmc more opportunity to > > correctly response kernel. > > > > Signed-off-by: Xianting Tian <tian.xiant...@h3c.com> > > --- > > drivers/char/ipmi/ipmi_msghandler.c | 17 ++++++++++++++--- > > 1 file changed, 14 insertions(+), 3 deletions(-) > > > > diff --git a/drivers/char/ipmi/ipmi_msghandler.c > > b/drivers/char/ipmi/ipmi_msghandler.c > > index 737c0b6b2..bfb2de77a 100644 > > --- a/drivers/char/ipmi/ipmi_msghandler.c > > +++ b/drivers/char/ipmi/ipmi_msghandler.c > > @@ -34,6 +34,7 @@ > > #include <linux/uuid.h> > > #include <linux/nospec.h> > > #include <linux/vmalloc.h> > > +#include <linux/delay.h> > > > > #define IPMI_DRIVER_VERSION "39.2" > > > > @@ -60,6 +61,9 @@ enum ipmi_panic_event_op { #else #define > > IPMI_PANIC_DEFAULT IPMI_SEND_PANIC_EVENT_NONE #endif > > + > > +#define GET_DEVICE_ID_MAX_RETRY 5 > > + > > static enum ipmi_panic_event_op ipmi_send_panic_event = > > IPMI_PANIC_DEFAULT; > > > > static int panic_op_write_handler(const char *val, @@ -2426,19 > > +2430,26 @@ send_get_device_id_cmd(struct ipmi_smi *intf) static int > > __get_device_id(struct ipmi_smi *intf, struct bmc_device *bmc) { > > int rv; > > - > > - bmc->dyn_id_set = 2; > > + unsigned int retry_count = 0; > > > > intf->null_user_handler = bmc_device_id_handler; > > > > +retry: > > + bmc->dyn_id_set = 2; > > + > > rv = send_get_device_id_cmd(intf); > > if (rv) > > return rv; > > > > wait_event(intf->waitq, bmc->dyn_id_set != 2); > > > > - if (!bmc->dyn_id_set) > > + if (!bmc->dyn_id_set) { > > + msleep(1000); > > + if (++retry_count <= GET_DEVICE_ID_MAX_RETRY) > > + goto retry; > > + > > rv = -EIO; /* Something went wrong in the fetch. */ > > + } > > > > /* dyn_id_set makes the id data available. */ > > smp_rmb(); > > -- > > 2.17.1 > > _______________________________________________ Openipmi-developer mailing list Openipmi-developer@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openipmi-developer