Bela Lubkin wrote:
>   
>
> Corey (or anyone), do you have a tool which parses these logs
> and translates the unexpected netfn messages to a human
> readable representation like the "Get SDR" you show above?  I
> can do it manually (or write a script), but don't want to
> duplicate effort if it already exists.
>   
Well, no, not really.  Those logs are not supposed to come out, so IMHO
there's not much value in decoding them.  They are really more a
"something is going wrong, here's hopefully enough information for you
to fix it" log.

>   
>> You should report this to Sun, though if everything else is working
>> correctly and it's not spewing out these errors it shouldn't affect
>> normal operations very much.
>>     
>
> We have seen cases where it spews, and others where it it's
> intermittent but when it happens it does mess up some IPMI-
> based monitoring.  And I think some cases where it's just a
> sporadic complaint with no noticable consequences.
>
> I think at least some cases of this are due to SMM (System
> Management Mode -- special out of band CPU mode that gets
> underneath the host OS).  An SMI (SMM interrupt) comes in,
> SMM BIOS code executes and either reads the response you were
> expecting or sends a new command to which you then read the
> response.  If this is the case, BIOS authors have to fix it
> by eliminating the IPMI access; using a different channel
> or interface; or ensuring that they never issue a command
> while a response is pending and always consume the response
> after issuing a command (not sure if that last method is
> actually viable).
>   
That's an interesting theory, but I really can't imagine that's the
problem.  Surely they have a different channel for that information,
otherwise it would be very hard to account for in the driver.  Plus, if
that was happening, it would flush out the interface, start its own
message and finish it.  The driver would either see the interface in a
strange state or would time out, it wouldn't see a wrong message.  And
it wouldn't explain the off by one errors Andy is seeing.

I can only think of two things that could cause this:

A new message could get started while one is in progress.  I can't see
how that could happen though, there's a queue in ipmi_si_intf.c and
start_kcs_transaction() should reject a new message if its not in idle
state.  If something got messed up, you should see timeouts.

A message gets freed while in use, then gets reused.  But that doesn't
really explain the symptoms, especially the "off by one" problem Andy is
seeing.

So I'm kind of out of ideas.  Time for bed here, and I'll spend some
time thinking about it.  I have the test running now on a machine trying
to reproduce.  To tell if it's really the hardware would require
instrumenting the driver to keep a trace buffer of bytes
written/received and dump the trace buffer when the error occurs.  I
don't think it's time for that, it kind of looks like a driver issue.

-corey

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Openipmi-developer mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/openipmi-developer

Reply via email to