Hi Garret,

It's nice to see someone tackling another IO subsystem.

You should separate 'faults' from 'errors'.  In the FMA, a fault is 
defined as something that is broken (and associated with a piece of 
hardware) or defective (and associated with a piece of code).  What you 
you have described below are errors.  Errors are symptoms produced 
faults.  We can use the information captured at the time the error is 
detected to work what is broken or defective.

It's kinda like when you go to the doctor with a bunch of symptoms that 
you've noticed and ask for a diagnosis.  You wouldn't want the doctor to 
just reiterate your symptoms back to you.  You want the doctor to tell 
what's wrong with you.  That's what we do with FMA,  error information 
is captured by the error detectors and fed to a diagnosis engine who 
tells us what's broken.  For example, a PCI parity error results in a 
diagnosis that tells us that a PCI card may be busted and needs to be 
replaced.

Sorry for the diatribe but it's important to make sure we're on the same 
page.

First thing to do is describe the different types of faulty or defective 
components in your subsystem.  Something like:

        - controller
        - sdcard
        - firmware (?)
        - target

We call these ASRUs (or sometimes resources).

Now think about how each of the error symptoms below can be explained by 
one or more faults in your ASRU list.  What algorithm would you use 
given each possible error or set of errors to diagnose the problem and 
answer the question: what's broken?.

Garrett D'Amore wrote:
> First a bit of background.  I've developed a framework for SDcard 
> drivers called "sda".  This supports both host drivers (e.g. "sdhost") 
> and target drivers (e.g. "sdcard").   Actually, "sdcard" itself is a 
> pseudo-nexus driver like scsa2usb... it allows "sd(7d)" to act as the 
> ultimate target for these kinds of memory cards.  The full details are 
> in PSARC 2007/659 (SDcard Stack Phase I.)
> 
> So what I'm trying to figure out is how to "enable" this stuff for FMA.  
> (Or, alternatively, get an appropriate waiver.  That might not be as bad 
> as it sounds... its probably pretty unlikely that that anyone will care 
> too much if their SDcard goes south... just remove and reinsert in most 
> cases.)
> 
> There are several classes of fault that I can imagine occurring:
> 
> 1) errors coming from the host's parent.  E.g. PCI parity errors, etc.  
> I think I understand the docs on how to do this.

Here, I think your nexus or framework need simply call 
pci_ereport_post() and the generic PCI diagnosis algorithms should work 
out the faulty ASRU (controller).

> 
> 2) errors that are specific to the host controller.  E.g. an 
> over-current error, or a CRC error interrupt on the SD data pins.

These errors sound hardware specific and you may need to define special 
diagnosis algorithms but perhaps there are certain classes of errors 
that can be diagnosed by a general-purpose algorithm.

> 
> 3) errors that only the framework can tell.  E.g. the card is requesting 
> an illegal voltage change, or the card has failed to generate a 
> "relative card address" properly after several attempts.    Clearly it 
> would be nice if the framework could participate here.

Absolutely.  This is where the framework can detect and report errors 
(ereport events) and diagnose problems that are common for all 
components under its control w/o having to involve your consumers.

Typically, what happens is you develop an error reporting interface (ala 
pci_fm_ereport_post()) for errors detected by the framework.  You can 
use fm_ereport_post() (uts/common/os/fm.c) or ddi_fm_ereport_post() 
(uts/common/os/ddifm.c) as the underlying implementation. 
ddi_fm_ereport_post() is evolving whereas the interfaces in fm.c are 
project private.   Think about the ereport classes and event payload 
your diagnosis software will need to work out what's wrong and design 
the interfaces accordingly.

And just like for 2), you'll need come with the algorithms to do the 
diagnosis of these errors and which ASRUs (resources) are faulty.

> 
> 4) errors that the target driver can tell.  E.g. a target-specific error 
> in response to a block transfer.  (E.g. an attempt to write a block to a 
> protected sector.)

I think you can punt here to the common sd FMA project.

So now, you need to think about how you want to deliver your diagnosis 
software.  The algorithms can range from simple (map an error to a 
fault) to complex.  Some errors you may want to feed through serd 
engines such that a certain number of errors have to occur before a 
fault diagnosis is issued.  Other diagnoses may rely upon the occurance 
of a particular combination of errors.

In any case, there are two ways to code your diagnosis software.  The 
first is by writing a set of eft diagnosis rules like you see for PCI or 
writing a C-based diagnosis fmd plugin that subscribes to your 
particular error reports (ereports).

If most of your diagnoses are simple 1-to-1 mappings of errors to 
faults, eft is proabably your best bet.  On the other hand, complicated 
algorithms can be tricky when using an eft rules set.

> 
> What I would like to do is have some help/guidance in figuring out how 
> to architect FMA for this kind of solution.  I did see PCI support, but 
> I'm not finding any other good examples of my kind of framework with FMA 
> support.  (Notably neither USB nor 1394 frameworks have FMA support.)  
> Can anyone offer specific advice or documentation to read?  I've read 
> the published documentation that I could find, but it seemed pretty 
> specific to leaf-drivers, and I'm not sure how to get something liek 
> cases #2 and #3 handled properly.

This should be as clear as mud by now.  Instructions on how to develop a 
diagnosis plugin is described in the fmd PRM (see 
ttp://opensolaris.org/os/community/fm).  For samples in developing 
ereport generation interfaces for your framework, search the OpenSolaris 
code for calls to fm_ereport_post().  The final thing you'll need to do 
is write a libtopo enumerator to tack on the SD topology (list of ASRU 
and resource instances controlled by the sdcard framework).  The latest 
PRM describes libtopo and how to write an enumerator. There are also 
plenty of examples in the source (lib/fm/topo/modules).

As far as your list of deliverables go, they will look something like:

        - specification of ereport events for sdcard framework for 3)
        - optional specification for controllers for 2)
        - ereport generation routine for sdcard framework for 3)
        - optional ereport generation routine for controller drivers for 2)
        - diagnosis plugin or eft rules for 3) and optionally 2)
        - libtopo enumerator for the sdcard topology

Cindi

_______________________________________________
fm-discuss mailing list
fm-discuss@opensolaris.org

Reply via email to