If the BMC is in a state where it is partially responding but not really there, the driver could go into an infinite loop trying error recovery over and over.
The device should eventually come back, but we don't want to be continually retrying. Add a delay between retries. Signed-off-by: Corey Minyard <co...@minyard.net> --- drivers/char/ipmi/ipmi_kcs_sm.c | 4 ++-- drivers/char/ipmi/ipmi_si_intf.c | 9 +++++++-- 2 files changed, 9 insertions(+), 4 deletions(-) Thanks for the bug report and debugging info. I think I know what is going on, I've attached a patch that should hopefully fix it. Basically, it looks like the BMC is alive enough that it sort of responds to the host, but not alive enough to actually complete a transaction. The driver needs to not immediately retry in that case, it needs to delay a bit. It passes all my tests, but the situation you are in would be hard to manufacture for me. Can you try this patch? -corey diff --git a/drivers/char/ipmi/ipmi_kcs_sm.c b/drivers/char/ipmi/ipmi_kcs_sm.c index ecfcb50302f6..20f3611c5444 100644 --- a/drivers/char/ipmi/ipmi_kcs_sm.c +++ b/drivers/char/ipmi/ipmi_kcs_sm.c @@ -467,7 +467,7 @@ static enum si_sm_result kcs_event(struct si_sm_data *kcs, long time) if (state != KCS_READ_STATE) { start_error_recovery(kcs, "Not in read state for error2"); - break; + return SI_SM_CALL_WITH_TICK_DELAY; } if (!check_obf(kcs, status, time)) return SI_SM_CALL_WITH_DELAY; @@ -481,7 +481,7 @@ static enum si_sm_result kcs_event(struct si_sm_data *kcs, long time) if (state != KCS_IDLE_STATE) { start_error_recovery(kcs, "Not in idle state for error3"); - break; + return SI_SM_CALL_WITH_TICK_DELAY; } if (!check_obf(kcs, status, time)) diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c index 8b5524069c15..3f4747ae5ddb 100644 --- a/drivers/char/ipmi/ipmi_si_intf.c +++ b/drivers/char/ipmi/ipmi_si_intf.c @@ -790,7 +790,10 @@ static enum si_sm_result smi_event_handler(struct smi_info *smi_info, */ return_hosed_msg(smi_info, IPMI_ERR_UNSPECIFIED); } - goto restart; + /* + * If the device isn't working, we want a delay before + * trying again. + */ } /* @@ -888,15 +891,17 @@ static void flush_messages(void *send_info) { struct smi_info *smi_info = send_info; enum si_sm_result result; + int loops_left = 10000; /* Don't try forever. */ /* * Currently, this function is called only in run-to-completion * mode. This means we are single-threaded, no need for locks. */ result = smi_event_handler(smi_info, 0); - while (result != SI_SM_IDLE) { + while (result != SI_SM_IDLE && loops_left > 0) { udelay(SI_SHORT_TIMEOUT_USEC); result = smi_event_handler(smi_info, SI_SHORT_TIMEOUT_USEC); + loops_left--; } } -- 2.43.0 _______________________________________________ Openipmi-developer mailing list Openipmi-developer@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openipmi-developer