Dear IPMI supporters, We observe a window in IPMI BT's opportunistic get capabilities request, wherein GET_DEVICE_GUID and GET_DEVICE_ID requests may start while the BT state machine is in WR_CONSUME. Following this, the 0xD5 error code is forced in bt_start_transaction, IPMI fails to initialize, and the interface is torn down. There is no mechanism to retry bringing up the interface in open() /dev/ipmi. This leaves IPMI hosed until you reload modules. Looks to happen after we call schedule().
The problem is intermittent, but we observe ~50% fail rate on 4.18 kernels on our UV4 systems. Here's the relevant snippet (the action starts at 120.367379): [ 118.316764] IPMI System Interface driver. [ 118.354246] ipmi_si dmi-ipmi-si.0: ipmi_platform: probing via SMBIOS [ 118.354249] ipmi_si: SMBIOS: mem 0xce4 regsize 1 spacing 1 irq 6 [ 118.354251] ipmi_si: Adding SMBIOS-specified bt state machine [ 118.354311] ipmi_si IPI0001:00: ipmi_platform: probing via ACPI [ 118.381164] ipmi_si IPI0001:00: [io 0x0ce4-0x0ce6] regsize 1 spacing 1 irq 6 [ 118.381167] ipmi_si: Adding ACPI-specified bt state machine [ 118.382811] ipmi_si: Trying SMBIOS-specified bt state machine at mem address 0xce4, slave address 0x20, irq 6 [ 118.382815] ipmi_si dmi-ipmi-si.0: Could not set up I/O space [ 118.382816] ipmi_si: Trying ACPI-specified bt state machine at i/o address 0xce4, slave address 0x0, irq 6 [ 118.382911] IPMI BT: flag reset [ ] [ 118.383008] BT: XACTION [ ] TO=5000000 - 0 [ 118.383059] BT: WR_BYTES [ ] TO=5000000 - 0 [ 118.383272] BT: WR_CONSUME [ H2B ] TO=5000000 - 0 [ 118.388034] BT: WR_CONSUME [ ] TO=5000000 - 4000 [ 118.388059] BT: RD_WAIT [ ] TO=4996000 - 0 [ 118.396017] BT: RD_WAIT [ ] TO=4996000 - 4000 [ 118.492015] BT: RD_WAIT [ ] TO=4992000 - 4000 [ 118.520073] BT: RD_WAIT [ ] TO=4988000 - 4000 [ 118.528079] BT: RD_WAIT [ ] TO=4984000 - 4000 [ 118.540049] BT: RD_WAIT [ ] TO=4980000 - 4000 [ 118.548011] BT: RD_WAIT [ ] TO=4976000 - 4000 [ 118.564033] BT: RD_WAIT [ ] TO=4972000 - 4000 [ 118.572109] BT: RD_WAIT [ ] TO=4968000 - 4000 [ 118.580013] BT: RD_WAIT [ ] TO=4964000 - 4000 [ 118.588017] BT: RD_WAIT [ ] TO=4960000 - 4000 [ 118.608014] BT: RD_WAIT [ ] TO=4956000 - 4000 [ 118.628058] BT: RD_WAIT [ ] TO=4952000 - 4000 [ 118.636053] BT: RD_WAIT [ ] TO=4948000 - 4000 [ 118.648016] BT: RD_WAIT [ ] TO=4944000 - 4000 [ 118.672028] BT: RD_WAIT [ ] TO=4940000 - 4000 [ 118.684056] BT: RD_WAIT [ ] TO=4936000 - 4000 [ 118.696021] BT: RD_WAIT [ ] TO=4932000 - 4000 [ 118.708015] BT: RD_WAIT [ ] TO=4928000 - 4000 [ 118.720018] BT: RD_WAIT [ ] TO=4924000 - 4000 [ 118.732063] BT: RD_WAIT [ ] TO=4920000 - 4000 [ 118.744021] BT: RD_WAIT [ ] TO=4916000 - 4000 [ 118.752089] BT: RD_WAIT [ ] TO=4912000 - 4000 [ 118.764011] BT: RD_WAIT [ ] TO=4908000 - 4000 [ 118.784009] BT: RD_WAIT [ ] TO=4904000 - 4000 [ 118.808011] BT: RD_WAIT [ ] TO=4900000 - 4000 [ 118.832010] BT: RD_WAIT [ ] TO=4896000 - 4000 [ 118.844011] BT: RD_WAIT [ ] TO=4892000 - 4000 [ 118.860036] BT: RD_WAIT [ ] TO=4888000 - 4000 [ 118.868102] BT: RD_WAIT [ ] TO=4884000 - 4000 [ 118.876012] BT: RD_WAIT [ ] TO=4880000 - 4000 [ 118.904011] BT: RD_WAIT [ ] TO=4876000 - 4000 [ 118.924013] BT: RD_WAIT [ ] TO=4872000 - 4000 [ 118.948016] BT: RD_WAIT [ ] TO=4868000 - 4000 [ 118.984043] BT: RD_WAIT [ ] TO=4864000 - 4000 [ 118.992059] BT: RD_WAIT [ ] TO=4860000 - 4000 [ 118.995422] for 2000000 [ 118.995650] [ 119.000026] BT: RD_WAIT [ ] TO=4856000 - 4000 [ 119.012018] BT: RD_WAIT [ ] TO=4852000 - 4000 [ 119.032010] BT: RD_WAIT [ ] TO=4848000 - 4000 [ 119.048013] BT: RD_WAIT [ ] TO=4844000 - 4000 [ 119.060013] BT: RD_WAIT [ ] TO=4840000 - 4000 [ 119.080014] BT: RD_WAIT [ ] TO=4836000 - 4000 [ 119.096021] BT: RD_WAIT [ ] TO=4832000 - 4000 [ 119.120009] BT: RD_WAIT [ ] TO=4828000 - 4000 [ 119.128017] BT: RD_WAIT [ ] TO=4824000 - 4000 [ 119.136011] BT: RD_WAIT [ ] TO=4820000 - 4000 [ 119.144013] BT: RD_WAIT [ ] TO=4816000 - 4000 [ 119.164008] BT: RD_WAIT [ ] TO=4812000 - 4000 [ 119.184010] BT: RD_WAIT [ ] TO=4808000 - 4000 [ 119.204015] BT: RD_WAIT [ ] TO=4804000 - 4000 [ 119.264034] BT: RD_WAIT [ ] TO=4800000 - 4000 [ 119.272013] BT: RD_WAIT [ ] TO=4796000 - 4000 [ 119.288013] BT: RD_WAIT [ ] TO=4792000 - 4000 [ 119.308011] BT: RD_WAIT [ ] TO=4788000 - 4000 [ 119.324011] BT: RD_WAIT [ ] TO=4784000 - 4000 [ 119.344010] BT: RD_WAIT [ ] TO=4780000 - 4000 [ 119.364027] BT: RD_WAIT [ ] TO=4776000 - 4000 [ 119.376011] BT: RD_WAIT [ ] TO=4772000 - 4000 [ 119.396010] BT: RD_WAIT [ ] TO=4768000 - 4000 [ 119.408013] BT: RD_WAIT [ ] TO=4764000 - 4000 [ 119.424012] BT: RD_WAIT [ B2H ] TO=4760000 - 4000 [ 119.434845] BT: CLEAR_B2H [ H_BUSY ] TO=4756000 - 0 [ 119.592204] BT: RD_BYTES [ H_BUSY ] TO=4756000 - 0 [ 119.650481] BT: XACTION [ ] TO=5000000 - 0 [ 119.676062] BT: WR_BYTES [ ] TO=5000000 - 0 [ 119.734596] BT: WR_CONSUME [ H2B ] TO=5000000 - 0 [ 119.740019] BT: WR_CONSUME [ ] TO=5000000 - 4000 [ 119.740036] BT: RD_WAIT [ ] TO=4996000 - 0 [ 119.748032] BT: RD_WAIT [ B2H ] TO=4996000 - 4000 [ 119.748101] BT: CLEAR_B2H [ H_BUSY ] TO=4992000 - 0 [ 119.748124] BT: RD_BYTES [ H_BUSY ] TO=4992000 - 0 [ 119.748348] BT: XACTION [ ] TO=5000000 - 0 [ 119.748382] BT: WR_BYTES [ ] TO=5000000 - 0 [ 119.748537] BT: WR_CONSUME [ H2B ] TO=5000000 - 0 [ 119.760013] BT: WR_CONSUME [ B2H ] TO=5000000 - 4000 [ 119.806271] BT: RD_WAIT [ B2H ] TO=4996000 - 0 [ 119.806282] BT: CLEAR_B2H [ H_BUSY ] TO=4996000 - 0 [ 119.816909] BT: RD_BYTES [ H_BUSY ] TO=4996000 - 0 [ 119.816944] BT: XACTION [ ] TO=5000000 - 0 [ 119.827061] BT: WR_BYTES [ ] TO=5000000 - 0 [ 119.827088] BT: WR_CONSUME [ H2B ] TO=5000000 - 0 [ 119.852024] BT: WR_CONSUME [ B2H ] TO=5000000 - 4000 [ 119.875752] BT: RD_WAIT [ B2H ] TO=4996000 - 0 [ 119.895099] BT: CLEAR_B2H [ H_BUSY ] TO=4996000 - 0 [ 119.991768] BT: RD_BYTES [ H_BUSY ] TO=4996000 - 0 [ 119.991805] BT: XACTION [ ] TO=5000000 - 0 [ 120.004993] BT: WR_BYTES [ ] TO=5000000 - 0 [ 120.027171] BT: WR_CONSUME [ H2B ] TO=5000000 - 0 [ 120.048024] BT: WR_CONSUME [ ] TO=5000000 - 4000 [ 120.072949] BT: RD_WAIT [ B2H ] TO=4996000 - 0 [ 120.085892] BT: CLEAR_B2H [ H_BUSY ] TO=4996000 - 0 [ 120.097374] BT: RD_BYTES [ H_BUSY ] TO=4996000 - 0 [ 120.097406] BT: XACTION [ ] TO=5000000 - 0 [ 120.125568] BT: WR_BYTES [ ] TO=5000000 - 0 [ 120.141827] BT: WR_CONSUME [ H2B ] TO=5000000 - 0 [ 120.264026] BT: WR_CONSUME [ B2H ] TO=5000000 - 4000 [ 120.287267] BT: RD_WAIT [ B2H ] TO=4996000 - 0 [ 120.287276] BT: CLEAR_B2H [ H_BUSY ] TO=4996000 - 0 [ 120.287280] BT: RD_BYTES [ H_BUSY ] TO=4996000 - 0 [ 120.287429] ipmi_si IPI0001:00: Using irq 6 [ 120.287447] **Interrupt: 1534434704.287445059 [ 120.287457] BT: XACTION [ ] TO=5000000 - 0 [ 120.287466] BT: WR_BYTES [ ] TO=5000000 - 0 [ 120.287500] BT: WR_CONSUME [ H2B ] TO=5000000 - 0 [ 120.297445] **Interrupt: 1534434704.297445691 [ 120.332534] BT: WR_CONSUME [ B2H ] TO=5000000 - 0 [ 120.332538] BT: RD_WAIT [ B2H ] TO=5000000 - 0 [ 120.342956] BT: CLEAR_B2H [ H_BUSY ] TO=5000000 - 0 [ 120.342969] BT: RD_BYTES [ H_BUSY ] TO=5000000 - 0 [ 120.343244] **Done: 1534434704.343244675 [ 120.367379] BT: IDLE [ ] TO=5000000 - 0 [ 120.367384] BT: CAP_BEGIN [ ] TO=5000000 - 0 [ 120.376634] **Timer: 1534434704.376633453 [ 120.376646] BT: XACTION [ ] TO=5000000 - 92000 [ 120.376683] BT: WR_BYTES [ ] TO=4908000 - 0 [ 120.376995] BT: WR_CONSUME [ H2B ] TO=4908000 - 0 [ 120.377025] Send: 18 08 [ 120.377027] **Enqueue: 1534434704.377028726 [ 120.377028] **Start2: 1534434704.377029859 [ 120.377061] BT: WR_CONSUME [ ] TO=4908000 - 0 [ 120.377137] BT: RD_WAIT [ ] TO=4908000 - 0 [ 120.377140] Recv:: 1c 08 d5 [ 120.377156] Send: 18 01 [ 120.377161] **Enqueue: 1534434704.377161047 [ 120.377164] **Start2: 1534434704.377164751 [ 120.377226] BT: RD_WAIT [ ] TO=4908000 - 0 [ 120.377237] Recv:: 1c 01 d5 [ 120.377242] ipmi_si IPI0001:00: IPMI message handler: device id demangle failed: -22 [ 120.377252] ipmi_si IPI0001:00: Unable to get the device id: -5 [ 120.377252] ipmi_si IPI0001:00: Unable to register device: error -5 [ 120.569865] IPMI SSIF Interface driver Also FYI, Bad cleanup code causes a null pointer and adds a race condition; this is addressed by "[PATCH] Remove redundant cleanup in ipmi_register_smi" from Justin Ernst <justin.er...@hpe.com>, which I paste here: diff --git a/drivers/char/ipmi/ipmi_msghandler.c b/drivers/char/ipmi/ipmi_msghandler.c index 51832b8a2c62..3b0b50c4f064 100644 --- a/drivers/char/ipmi/ipmi_msghandler.c +++ b/drivers/char/ipmi/ipmi_msghandler.c @@ -3395,12 +3395,12 @@ int ipmi_register_smi(const struct ipmi_smi_handlers *handlers, out: if (rv) { - ipmi_bmc_unregister(intf); - list_del_rcu(&intf->link); + /* + * ipmi_unregister_smi must be called to clean up after + * failure. We unlock the mutex to allow ipmi_unregister_smi + * to lock it and perform cleanup. + */ mutex_unlock(&ipmi_interfaces_mutex); - synchronize_srcu(&ipmi_interfaces_srcu); - cleanup_srcu_struct(&intf->users_srcu); - kref_put(&intf->refcount, intf_free); } else { /* * Keep memory order straight for RCU readers. Make -- ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Openipmi-developer mailing list Openipmi-developer@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openipmi-developer