Corey Minyard wrote:

> Bela Lubkin wrote:
> >
> > Corey (or anyone), do you have a tool which parses these logs
> > and translates the unexpected netfn messages to a human
> > readable representation like the "Get SDR" you show above?  I
> > can do it manually (or write a script), but don't want to
> > duplicate effort if it already exists.

> Well, no, not really.  Those logs are not supposed to come 
> out, so IMHO there's not much value in decoding them.  They are
> really more a "something is going wrong, here's hopefully enough
> information for you to fix it" log.

Sure, but when you set out to fix it it's probably going to help
if you know what the various factions were trying to do.  The
user-level code calling into the driver, the SMM code, whatever is
running inside the BMC -- they and the engineers that work on them
are likely to all have different terminology and display
representations, different #defines etc.  Being able to communicate
the hex _and_ human readable interpretations helps overcome those
differences.

You should not read my question as a complaint "why isn't this
translated / why doesn't this log translator exist?", I was just
asking if anyone had already done it, to avoid duplicating effort.

> > I think at least some cases of this are due to SMM (System
> > Management Mode -- special out of band CPU mode that gets
> > underneath the host OS).  An SMI (SMM interrupt) comes in,
> > SMM BIOS code executes and either reads the response you were
> > expecting or sends a new command to which you then read the
> > response.  If this is the case, BIOS authors have to fix it
> > by eliminating the IPMI access; using a different channel
> > or interface; or ensuring that they never issue a command
> > while a response is pending and always consume the response
> > after issuing a command (not sure if that last method is
> > actually viable).

> That's an interesting theory, but I really can't imagine that's the
> problem.  Surely they have a different channel for that information,
> otherwise it would be very hard to account for in the driver.

That's the point -- SMM code doing that would cause all sorts of
trouble, and we have reports of trouble.

I'm pretty sure we've had at least one case where the system OEM
reported (in very unclear language) that they'd changed the
method of communication between SMM and BMC to fix this problem.

The hardware clearly has to have a separate channel to talk to
SMM if it's going to avoid racing with a driver.  But even if the
hardware has it, that doesn't mean an SMM BIOS writer is going to
use it until the bug reports come in...

> Plus, if
> that was happening, it would flush out the interface, start its own
> message and finish it.  The driver would either see the interface in a
> strange state or would time out, it wouldn't see a wrong message.

Again, that's a description of what the SMM code would do if it
was written as well as possible (this time in a situation where
the hardware _didn't_ give it a private channel).  And again,
none of this code is written as well as possible the first time.

This sequence seems entirely possible to me:

  host driver -> command1  -> BMC
  [SMM interrupt]
     SMI BIOS -> command2  -> BMC
     SMI BIOS <- response1 <- BMC
     SMI BIOS: huh, that was weird...
  [return from SMI]
  host driver <- response2 <- BMC
  host driver: "BMC returned incorrect response, expected [response1]..."

It can only happen if the SMI BIOS is buggy / misdesigned.  Which
does happen in the real world, especially in prerelease machines.

> And it wouldn't explain the off by one errors Andy is seeing.

I agree with you on that.  His particular syndrome is way too
regular to have that cause.  I was speaking in more general
terms about global causes of "BMC returned incorrect response"
situations.

> I can only think of two things that could cause this:
> 
> A new message could get started while one is in progress.  I can't see
> how that could happen though, there's a queue in ipmi_si_intf.c and
> start_kcs_transaction() should reject a new message if its not in idle
> state.  If something got messed up, you should see timeouts.

Postulate an MP locking glitch.  This could be a bug in the
openipmi driver, but there are possibilities at other levels
as well.  Host OS locking primitive bug (e.g.: Linux
compiled for the wrong member of the x86 CPU family); cache
coherency defect in the host machine, chipset or CPU.  Bug
in another completely unrelated driver which has a higher
interrupt priority than IPMI, then screws up the interrupt
controller on its way out, allowing a 2nd IPMI interrupt
through to another CPU when it shouldn't have been.

I've seen each of those at least once in my career... (not with
respect to IPMI).

Any single instance (including Andy's and Mathieu's) _probably_ isn't
any of those, just a simple bug in openipmi or the BMC firmware.
Collectively over a large number of instances, you will see some
cases of these weirder causes.

> So I'm kind of out of ideas.  Time for bed here, and I'll spend some
> time thinking about it.  I have the test running now on a 
> machine trying
> to reproduce.  To tell if it's really the hardware would require
> instrumenting the driver to keep a trace buffer of bytes
> written/received and dump the trace buffer when the error occurs.  I
> don't think it's time for that, it kind of looks like a driver issue.

That trace buffer would be a good thing to have some day,
regardless of the outcome of these instances...

>Bela<
------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Openipmi-developer mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/openipmi-developer

Reply via email to