Hi,
 Any comments?

On 2013/8/13 16:17, Yijing Wang wrote:
> Hi all,
>    We found a bug when using ipmi driver in our machine recently. I don't 
> know this bug is caused by kernel ipmi driver,
> or maybe hardware should be responsible for this issue. Any comments are 
> welcome, thanks!
> 
> In our machine, we found ipmi driver always print messages like this after a 
> long run:
> 
> Bad version: Linux Kernel version 3.0.58(also has problem in SLES11 SP2)
> Good version: Linux Kernel 2.6.32
> 
> 1167440 Jul 30 17:01:15 BMC_test kernel: [ 5156.759059] KCS: State = 5, 42
> 1167441 Jul 30 17:01:15 BMC_test kernel: [ 5156.759063] KCS: State = 5, 42
> 1167442 Jul 30 17:01:15 BMC_test kernel: [ 5156.759066] KCS: State = 5, 0
> 1167443 Jul 30 17:01:15 BMC_test kernel: [ 5156.759070] KCS: State = 0, 1
> 1167444 Jul 30 17:01:15 BMC_test kernel: [ 5156.760065] KCS: State = 0, 
> 07.257249] KCS: State = 9, 0
> 1167445 Jul 30 17:01:15 BMC_test kernel: [ 5157.257252] KCS: State = 9, 0
> 1167446 Jul 30 17:01:15 BMC_test kernel: [ 5157.257256] KCS: State = 9, 0
> 1167447 Jul 30 17:01:15 BMC_test kernel: [ 5157.257259] KCS: State = 9, 0
> 1167448 Jul 30 17:01:15 BMC_test kernel: [ 5157.257263] KCS: State = 9, 0
> 1167449 Jul 30 17:01:15 BMC_test kernel: [ 5157.257263] KCS: State = 9, 0
> 1167450 Jul 30 17:01:15 BMC_test kernel: [ 5157.257263] KCS: State = 9, 0
> 1167451 Jul 30 17:01:15 BMC_test kernel: [ 5157.257263] KCS: State = 9, 0
> .........................................................................
> 
> We found once KCS enter state (9, 0), it can not exit from that loop.
> So after a period, BMC will reboot the OS because ipmi can not feed its 
> watchdog so long.
> 
> It seems that kernel always wait OBF bit to 1, but GET_STATUS_OBF(status) 
> return 0.
> because time is 0 here, so check_obf() always return 0.
> 
> static inline int check_obf(struct si_sm_data *kcs, unsigned char status,
>                           long time)
> {
>       if (!GET_STATUS_OBF(status)) {
>               kcs->obf_timeout -= time;
>               if (kcs->obf_timeout < 0) {
>                   start_error_recovery(kcs, "OBF not ready in time");
>                   return 1;
>               }
>               return 0;
>       }
>       kcs->obf_timeout = OBF_RETRY_TIMEOUT;
>       return 1;
> }
> 
> So kcs_event() always return SI_SM_CALL_WITH_DELAY.
>       case KCS_ERROR3:
>               if (state != KCS_IDLE_STATE) {
>                       start_error_recovery(kcs,
>                                            "Not in idle state for error3");
>                       break;
>               }
> 
>               if (!check_obf(kcs, status, time))
>                       return SI_SM_CALL_WITH_DELAY;
> 
> static enum si_sm_result smi_event_handler(struct smi_info *smi_info,
>                                          int time)
> {
>       enum si_sm_result si_sm_result;
> 
>  restart:
>       /*
>        * There used to be a loop here that waited a little while
>        * (around 25us) before giving up.  That turned out to be
>        * pointless, the minimum delays I was seeing were in the 300us
>        * range, which is far too long to wait in an interrupt.  So
>        * we just run until the state machine tells us something
>        * happened or it needs a delay.
>        */
>       si_sm_result = smi_info->handlers->event(smi_info->si_sm, time);
>       time = 0;
>       while (si_sm_result == SI_SM_CALL_WITHOUT_DELAY)   
> ------------------------------>It looks like we are always in the loop here
>               si_sm_result = smi_info->handlers->event(smi_info->si_sm, 0);
> 
> 
> We found Matthew Garrett committed several patches modified the related code 
> in smi_timeout()
> commit ea4078ca, commit 3326f4f2.
> 
> We tried to remove the if checking code, and test the machine under stress, 
> after more than 24h test, result is ok. without remove this if checking code,
> the bug will be triggered after about 8h run test.
> 
> do_mod_timer:
>       if (smi_result != SI_SM_IDLE)    ------------------->after remove this 
> line code, test result seems good. At least better than before.
>               mod_timer(&(smi_info->si_timer), timeout);
> 
> So this is the issue root cause?
> 
> Other, I don't know kernel whether needs to provide a mechanism to prevent 
> ipmi dirver entering this endless loop.
> Or this is hardware problem?
> 


-- 
Thanks!
Yijing


------------------------------------------------------------------------------
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
_______________________________________________
Openipmi-developer mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/openipmi-developer

Reply via email to