Re: [fm-discuss] FMA stuff for nexus/framework drivers?

Gavin Maltby Wed, 02 Jan 2008 06:45:01 -0800

Hi,

Garrett D'Amore wrote:

I want to run some initial ideas about my SDcard framework error report classes by the group. Here's what I'm thinking about so far:


Disclaimer: I know zilch about SDcard.

FMA error classes for SDA framework:

sda.timeout : Unexpected command or data timeout
sda.auto_cmd12 : Failure during AutoCMD 12 execution
sda.crc7 : CRC7 failure on CMD/DAT line
sda.protocol : Protocol or signaling error on CMD pins
sda.init : Card initialization failure
sda.ocr : Incompatible card/host OCR combination
sda.current : Current overlimit detected
sda.host : Internal host or slot failure
sda.powerup : Card failed to power up
sda.4bit : Failed to set wide (4-bit) bus mode
sda.clock : Failed to set clock speed
sda.cardtype : Card type unknown
sda.cid : Unable to get CID
sda.csd : Unable to get CSD
sda.parse : Unable to parse CID/CSD
sda.rca : Unable to get card RCA
sda.blklen : Unable to set block length
sda.props : Unable to set DDI properties (ENOMEM?)
sda.busy : Card removed while still busy
sda.version : Card version not supported
sda.child : Unable to create or online child node


On the face of it that appears like a reasonable set of ereports - descriptive
and not excessively numerous.  Are these the main list of detectable error
types as listed in the SDcard spec, or a list you have synthesized?

You also need to think about what you'll include on each ereport as additional
payload, other than the ereport class itself.  What will the detector FMRI
be, and is it already represented in our hc topology tree - if not we need
to work out where to extend the hc tree to to represent these nodes, and to
write an enumerator to achieve this.  What additiional error-specific
telemetry is available for these errors (status registers etc) and do
you want to include this in just raw form in the ereport payload or will
you also partially "decode" it to make it more readable (at the cost of some
space).  In deciding your set of ereports and their payloads you also need
to think ahead to how will diagnose from these ereports; eg in the
simplest case you can compound all the above ereports into a single
ereport class with an additional payload member that distinguishes the
individual error type - you might choose to do that if the diagnosis
was pretty much the same regardless of error type.

What would be the fully qualified ereport class - ereport.foo.bar.sda.ocr etc?

Note that other kinds of errors, such as, PCI errors, or DMA failures, should
be reported to the framework by the host directly.


I think so, yes.

Here are the C #define's that the framework will use for these classes.

SDA_ETIMEOUT
SDA_EACMD12
SDA_ECRC7
SDA_EPROTOCOL
SDA_EINIT
SDA_EOCR
SDA_ECURRENT
SDA_EHOST
SDA_EPOWER
SDA_E4BIT
SDA_ECLOCK
SDA_ECARDTYPE
SDA_ECID
SDA_ECSD
SDA_ERCA
SDA_EBLKLEN
SDA_EPROPS
SDA_EBUSY
SDA_EVERSION
SDA_ECHILD


These are really going to be private to your implementation so you
can indulge in any namespace conventions you fancy.  It is common,
but not required, to have your own header file that captures
most of the content of your ereports.  See, for example,

uts/intel/sys/fm/cpu/AMD.h
uts/intel/sys/fm/cpu/GMCA.h
uts/sparc/sys/fm/cpu/UltraSPARC-T1.h
uts/sparc/sys/fm/cpu/UltraSPARC-III.h

These have defines for the ereport subclasses and leaf classes,
payload member names, bitmaps for what payload members to include
for each ereport type, etc.

Corrective action for all except the following is fatal error. The
card and slot are offlined, and the slot will remain offline until the
card is removed and reinserted. (The kernel sda framework will remove
power from the slot when these errors are detected, to minimize
possible further damage.)


Have you considered whether you want/need this action taken synchronously,
at the time the error interrupt is handled, or asynchronously in the
userland response agent once a fault diagnosis is made from
the ereport raised when the event was detected?  If it is to be done
from a userland response agent then fmd will replay the fault at
restart etc so you can repeat the action - so you need to consider
how fmd can detect (via FMRI is_present etc operations) whether the
faulted resource is still present or has been removed/reinserted.
For say a cpu error we do this by detecting serial number change -
but in your case you might even consider a reboot to have "fixed"
the problem unless it is observed again?  Anyway, all this is about
keeping the fmd state in sync with the kernel and physical state -
e.g., if fmd has cached the fault and the card is removed and
reinserted then do we want fmd to consider the fault repaired or
the resource to be faulty but in use etc.

SDA_E4BIT: Degraded mode -- it will be slow
SDA_ECLOCK: Degraded mode -- it will be slow
SDA_ECHILD: No action to the slot. (Maybe other children exist.) The
child node of course will not be created.)
So the one that is weird to me is SDA_EHOST. It would be nice for the host controller to signal a generic error to the SDA framework, so that the framework can offline any children, and avoid trying to send further commands to the device. The problem here is that I can only postulate about certain errors... there could be many more that I've not thought about. But as an example, the device can report an error for a DMA failure. (I'd guess this would also be reported as a PCI abort, but I'm not sure.) For a leaf node, I might imagine either using the predefined DDI_FM_DEVICE_xxx types, or creating my own for this particular device. For SDA host controllers I can imagine a few possible courses of action:
1) issue an SDA_EHOST report
2) issue an SDA_EHOST report, annotated with a human description with more detail for debug. (E.g. "desc" with string data type equal to "DMA error interrupt" or somesuch.) 3) issue both SDA_EHOST report (unannotated), and a more descriptive host-controller-specific ereport.
4) issue only a descriptive host-controller ereport to FMA
I think I prefer #2 since it makes it easiest for host controller drivers (more centralized logic is good) and since the failure recovery for all cases of SDA_EHOST is the same, but I'm not entirely sure how sane the idea of passing "opaque human readable messages" (e.g. for posting to syslog) is. I'm interested to hear thoughts.


Steve may have more understanding and ideas here.  We don't expect 100%
diagnosis coverage - aim for at least the common failures and to
at least capture full telemetry details for those rarer and
difficult to diagnose errors (those requiring human intervention).
You can raise a generic catch-all fault for those, for which the
corresponding knowledge article could provide update troubleshooting
detail but could also request (as the console fault message could too)
that a call be placed with support to diagnose this fault.

The other weird ones are SDA_EPROPS and SDA_ECHILD. These are basically the result of allocation failures in kernel DDI routines. It means that when a card is inserted, the child node for that card might not be created on hotplug. I think we want to know why... right now I am just cmn_err'ing this, but FMA seems like the way to go for this. Is that right, or is FMA really only for catching hardware faults?


Generically speaking we would like to introduce a generic "software FMA"
diagnosis framework into the Solaris kernel and userland.  We do not
have the hooks now, don't currently have the project resourced,
and software FMA presents a number of new challenges to the
overall framework we have not already addressed from our mostly
hardware-oriented past.  Right now I think I'd discourage individual
subsystems from rolling their own here because it will just make a
later implementation more difficult and complex.  If you consider
this example of a kernel memory allocation failure if we routinely
report this (via full ereport with subsequent diagnosis actions)
at the point of observing kmem_alloc failure then the very likely
overall system behaviour is that when allocation failures occur
we'll get a severe "storm" of ereports from all affected subsystems
all telling us no more than "it failed" and an indication of failure
rate (based on ereport numbers).  That would have the diagnosis
engines chewing cpu time, ereport preparation in the kernel
possible making the memory situation still tighter, and so on.
What would be better would be for the kmem backend to be
responsible for reporting these failures (eg vmem arena
exhausted) and for a driver to avoid raising ereports
that it knows to be explained by an upstream failure
to allocate memory, or to indicate in the ereport
payload that the error was observed under conditions of
some allocation having failed.

Cheers

Gavin

Thanks!

-- Garrett

cindi wrote:
Hi Garret,

It's nice to see someone tackling another IO subsystem.
You should separate 'faults' from 'errors'. In the FMA, a fault is defined as something that is broken (and associated with a piece of hardware) or defective (and associated with a piece of code). What you you have described below are errors. Errors are symptoms produced faults. We can use the information captured at the time the error is detected to work what is broken or defective.
It's kinda like when you go to the doctor with a bunch of symptoms that you've noticed and ask for a diagnosis. You wouldn't want the doctor to just reiterate your symptoms back to you. You want the doctor to tell what's wrong with you. That's what we do with FMA, error information is captured by the error detectors and fed to a diagnosis engine who tells us what's broken. For example, a PCI parity error results in a diagnosis that tells us that a PCI card may be busted and needs to be replaced.
Sorry for the diatribe but it's important to make sure we're on the same page.
First thing to do is describe the different types of faulty or defective components in your subsystem. Something like:
        - controller
        - sdcard
        - firmware (?)
        - target

We call these ASRUs (or sometimes resources).
Now think about how each of the error symptoms below can be explained by one or more faults in your ASRU list. What algorithm would you use given each possible error or set of errors to diagnose the problem and answer the question: what's broken?.
Garrett D'Amore wrote:
First a bit of background. I've developed a framework for SDcard drivers called "sda". This supports both host drivers (e.g. "sdhost") and target drivers (e.g. "sdcard"). Actually, "sdcard" itself is a pseudo-nexus driver like scsa2usb... it allows "sd(7d)" to act as the ultimate target for these kinds of memory cards. The full details are in PSARC 2007/659 (SDcard Stack Phase I.)
So what I'm trying to figure out is how to "enable" this stuff for FMA. (Or, alternatively, get an appropriate waiver. That might not be as bad as it sounds... its probably pretty unlikely that that anyone will care too much if their SDcard goes south... just remove and reinsert in most cases.)
There are several classes of fault that I can imagine occurring:
1) errors coming from the host's parent. E.g. PCI parity errors, etc. I think I understand the docs on how to do this.
Here, I think your nexus or framework need simply call pci_ereport_post() and the generic PCI diagnosis algorithms should work out the faulty ASRU (controller).
2) errors that are specific to the host controller. E.g. an over-current error, or a CRC error interrupt on the SD data pins.
These errors sound hardware specific and you may need to define special diagnosis algorithms but perhaps there are certain classes of errors that can be diagnosed by a general-purpose algorithm.
3) errors that only the framework can tell. E.g. the card is requesting an illegal voltage change, or the card has failed to generate a "relative card address" properly after several attempts. Clearly it would be nice if the framework could participate here.
Absolutely. This is where the framework can detect and report errors (ereport events) and diagnose problems that are common for all components under its control w/o having to involve your consumers.
Typically, what happens is you develop an error reporting interface (ala pci_fm_ereport_post()) for errors detected by the framework. You can use fm_ereport_post() (uts/common/os/fm.c) or ddi_fm_ereport_post() (uts/common/os/ddifm.c) as the underlying implementation. ddi_fm_ereport_post() is evolving whereas the interfaces in fm.c are project private. Think about the ereport classes and event payload your diagnosis software will need to work out what's wrong and design the interfaces accordingly.
And just like for 2), you'll need come with the algorithms to do the diagnosis of these errors and which ASRUs (resources) are faulty.
4) errors that the target driver can tell. E.g. a target-specific error in response to a block transfer. (E.g. an attempt to write a block to a protected sector.)
I think you can punt here to the common sd FMA project.
So now, you need to think about how you want to deliver your diagnosis software. The algorithms can range from simple (map an error to a fault) to complex. Some errors you may want to feed through serd engines such that a certain number of errors have to occur before a fault diagnosis is issued. Other diagnoses may rely upon the occurance of a particular combination of errors.
In any case, there are two ways to code your diagnosis software. The first is by writing a set of eft diagnosis rules like you see for PCI or writing a C-based diagnosis fmd plugin that subscribes to your particular error reports (ereports).
If most of your diagnoses are simple 1-to-1 mappings of errors to faults, eft is proabably your best bet. On the other hand, complicated algorithms can be tricky when using an eft rules set.
What I would like to do is have some help/guidance in figuring out how to architect FMA for this kind of solution. I did see PCI support, but I'm not finding any other good examples of my kind of framework with FMA support. (Notably neither USB nor 1394 frameworks have FMA support.) Can anyone offer specific advice or documentation to read? I've read the published documentation that I could find, but it seemed pretty specific to leaf-drivers, and I'm not sure how to get something liek cases #2 and #3 handled properly.
This should be as clear as mud by now. Instructions on how to develop a diagnosis plugin is described in the fmd PRM (see ttp://opensolaris.org/os/community/fm). For samples in developing ereport generation interfaces for your framework, search the OpenSolaris code for calls to fm_ereport_post(). The final thing you'll need to do is write a libtopo enumerator to tack on the SD topology (list of ASRU and resource instances controlled by the sdcard framework). The latest PRM describes libtopo and how to write an enumerator. There are also plenty of examples in the source (lib/fm/topo/modules).
As far as your list of deliverables go, they will look something like:

        - specification of ereport events for sdcard framework for 3)
        - optional specification for controllers for 2)
        - ereport generation routine for sdcard framework for 3)
        - optional ereport generation routine for controller drivers for 2)
        - diagnosis plugin or eft rules for 3) and optionally 2)
        - libtopo enumerator for the sdcard topology

Cindi

_______________________________________________
fm-discuss mailing list
fm-discuss@opensolaris.org
_______________________________________________
fm-discuss mailing list
fm-discuss@opensolaris.org

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
fm-discuss mailing list
fm-discuss@opensolaris.org

Re: [fm-discuss] FMA stuff for nexus/framework drivers?

Reply via email to