Hi, Garrett D'Amore wrote:
I want to run some initial ideas about my SDcard framework error report classes by the group. Here's what I'm thinking about so far:
Disclaimer: I know zilch about SDcard.
FMA error classes for SDA framework: sda.timeout : Unexpected command or data timeout sda.auto_cmd12 : Failure during AutoCMD 12 execution sda.crc7 : CRC7 failure on CMD/DAT line sda.protocol : Protocol or signaling error on CMD pins sda.init : Card initialization failure sda.ocr : Incompatible card/host OCR combination sda.current : Current overlimit detected sda.host : Internal host or slot failure sda.powerup : Card failed to power up sda.4bit : Failed to set wide (4-bit) bus mode sda.clock : Failed to set clock speed sda.cardtype : Card type unknown sda.cid : Unable to get CID sda.csd : Unable to get CSD sda.parse : Unable to parse CID/CSD sda.rca : Unable to get card RCA sda.blklen : Unable to set block length sda.props : Unable to set DDI properties (ENOMEM?) sda.busy : Card removed while still busy sda.version : Card version not supported sda.child : Unable to create or online child node
On the face of it that appears like a reasonable set of ereports - descriptive and not excessively numerous. Are these the main list of detectable error types as listed in the SDcard spec, or a list you have synthesized? You also need to think about what you'll include on each ereport as additional payload, other than the ereport class itself. What will the detector FMRI be, and is it already represented in our hc topology tree - if not we need to work out where to extend the hc tree to to represent these nodes, and to write an enumerator to achieve this. What additiional error-specific telemetry is available for these errors (status registers etc) and do you want to include this in just raw form in the ereport payload or will you also partially "decode" it to make it more readable (at the cost of some space). In deciding your set of ereports and their payloads you also need to think ahead to how will diagnose from these ereports; eg in the simplest case you can compound all the above ereports into a single ereport class with an additional payload member that distinguishes the individual error type - you might choose to do that if the diagnosis was pretty much the same regardless of error type. What would be the fully qualified ereport class - ereport.foo.bar.sda.ocr etc?
Note that other kinds of errors, such as, PCI errors, or DMA failures, shouldbe reported to the framework by the host directly.
I think so, yes.
Here are the C #define's that the framework will use for these classes. SDA_ETIMEOUT SDA_EACMD12 SDA_ECRC7 SDA_EPROTOCOL SDA_EINIT SDA_EOCR SDA_ECURRENT SDA_EHOST SDA_EPOWER SDA_E4BIT SDA_ECLOCK SDA_ECARDTYPE SDA_ECID SDA_ECSD SDA_ERCA SDA_EBLKLEN SDA_EPROPS SDA_EBUSY SDA_EVERSION SDA_ECHILD
These are really going to be private to your implementation so you can indulge in any namespace conventions you fancy. It is common, but not required, to have your own header file that captures most of the content of your ereports. See, for example, uts/intel/sys/fm/cpu/AMD.h uts/intel/sys/fm/cpu/GMCA.h uts/sparc/sys/fm/cpu/UltraSPARC-T1.h uts/sparc/sys/fm/cpu/UltraSPARC-III.h These have defines for the ereport subclasses and leaf classes, payload member names, bitmaps for what payload members to include for each ereport type, etc.
Corrective action for all except the following is fatal error. The card and slot are offlined, and the slot will remain offline until the card is removed and reinserted. (The kernel sda framework will remove power from the slot when these errors are detected, to minimize possible further damage.)
Have you considered whether you want/need this action taken synchronously, at the time the error interrupt is handled, or asynchronously in the userland response agent once a fault diagnosis is made from the ereport raised when the event was detected? If it is to be done from a userland response agent then fmd will replay the fault at restart etc so you can repeat the action - so you need to consider how fmd can detect (via FMRI is_present etc operations) whether the faulted resource is still present or has been removed/reinserted. For say a cpu error we do this by detecting serial number change - but in your case you might even consider a reboot to have "fixed" the problem unless it is observed again? Anyway, all this is about keeping the fmd state in sync with the kernel and physical state - e.g., if fmd has cached the fault and the card is removed and reinserted then do we want fmd to consider the fault repaired or the resource to be faulty but in use etc.
SDA_E4BIT: Degraded mode -- it will be slow SDA_ECLOCK: Degraded mode -- it will be slow SDA_ECHILD: No action to the slot. (Maybe other children exist.) The child node of course will not be created.)So the one that is weird to me is SDA_EHOST. It would be nice for the host controller to signal a generic error to the SDA framework, so that the framework can offline any children, and avoid trying to send further commands to the device. The problem here is that I can only postulate about certain errors... there could be many more that I've not thought about. But as an example, the device can report an error for a DMA failure. (I'd guess this would also be reported as a PCI abort, but I'm not sure.) For a leaf node, I might imagine either using the predefined DDI_FM_DEVICE_xxx types, or creating my own for this particular device. For SDA host controllers I can imagine a few possible courses of action:1) issue an SDA_EHOST report2) issue an SDA_EHOST report, annotated with a human description with more detail for debug. (E.g. "desc" with string data type equal to "DMA error interrupt" or somesuch.) 3) issue both SDA_EHOST report (unannotated), and a more descriptive host-controller-specific ereport.4) issue only a descriptive host-controller ereport to FMAI think I prefer #2 since it makes it easiest for host controller drivers (more centralized logic is good) and since the failure recovery for all cases of SDA_EHOST is the same, but I'm not entirely sure how sane the idea of passing "opaque human readable messages" (e.g. for posting to syslog) is. I'm interested to hear thoughts.
Steve may have more understanding and ideas here. We don't expect 100% diagnosis coverage - aim for at least the common failures and to at least capture full telemetry details for those rarer and difficult to diagnose errors (those requiring human intervention). You can raise a generic catch-all fault for those, for which the corresponding knowledge article could provide update troubleshooting detail but could also request (as the console fault message could too) that a call be placed with support to diagnose this fault.
The other weird ones are SDA_EPROPS and SDA_ECHILD. These are basically the result of allocation failures in kernel DDI routines. It means that when a card is inserted, the child node for that card might not be created on hotplug. I think we want to know why... right now I am just cmn_err'ing this, but FMA seems like the way to go for this. Is that right, or is FMA really only for catching hardware faults?
Generically speaking we would like to introduce a generic "software FMA" diagnosis framework into the Solaris kernel and userland. We do not have the hooks now, don't currently have the project resourced, and software FMA presents a number of new challenges to the overall framework we have not already addressed from our mostly hardware-oriented past. Right now I think I'd discourage individual subsystems from rolling their own here because it will just make a later implementation more difficult and complex. If you consider this example of a kernel memory allocation failure if we routinely report this (via full ereport with subsequent diagnosis actions) at the point of observing kmem_alloc failure then the very likely overall system behaviour is that when allocation failures occur we'll get a severe "storm" of ereports from all affected subsystems all telling us no more than "it failed" and an indication of failure rate (based on ereport numbers). That would have the diagnosis engines chewing cpu time, ereport preparation in the kernel possible making the memory situation still tighter, and so on. What would be better would be for the kmem backend to be responsible for reporting these failures (eg vmem arena exhausted) and for a driver to avoid raising ereports that it knows to be explained by an upstream failure to allocate memory, or to indicate in the ereport payload that the error was observed under conditions of some allocation having failed. Cheers Gavin
Thanks! -- Garrett cindi wrote:Hi Garret, It's nice to see someone tackling another IO subsystem.You should separate 'faults' from 'errors'. In the FMA, a fault is defined as something that is broken (and associated with a piece of hardware) or defective (and associated with a piece of code). What you you have described below are errors. Errors are symptoms produced faults. We can use the information captured at the time the error is detected to work what is broken or defective.It's kinda like when you go to the doctor with a bunch of symptoms that you've noticed and ask for a diagnosis. You wouldn't want the doctor to just reiterate your symptoms back to you. You want the doctor to tell what's wrong with you. That's what we do with FMA, error information is captured by the error detectors and fed to a diagnosis engine who tells us what's broken. For example, a PCI parity error results in a diagnosis that tells us that a PCI card may be busted and needs to be replaced.Sorry for the diatribe but it's important to make sure we're on the same page.First thing to do is describe the different types of faulty or defective components in your subsystem. Something like:- controller - sdcard - firmware (?) - target We call these ASRUs (or sometimes resources).Now think about how each of the error symptoms below can be explained by one or more faults in your ASRU list. What algorithm would you use given each possible error or set of errors to diagnose the problem and answer the question: what's broken?.Garrett D'Amore wrote:First a bit of background. I've developed a framework for SDcard drivers called "sda". This supports both host drivers (e.g. "sdhost") and target drivers (e.g. "sdcard"). Actually, "sdcard" itself is a pseudo-nexus driver like scsa2usb... it allows "sd(7d)" to act as the ultimate target for these kinds of memory cards. The full details are in PSARC 2007/659 (SDcard Stack Phase I.)Here, I think your nexus or framework need simply call pci_ereport_post() and the generic PCI diagnosis algorithms should work out the faulty ASRU (controller).So what I'm trying to figure out is how to "enable" this stuff for FMA. (Or, alternatively, get an appropriate waiver. That might not be as bad as it sounds... its probably pretty unlikely that that anyone will care too much if their SDcard goes south... just remove and reinsert in most cases.)There are several classes of fault that I can imagine occurring:1) errors coming from the host's parent. E.g. PCI parity errors, etc. I think I understand the docs on how to do this.2) errors that are specific to the host controller. E.g. an over-current error, or a CRC error interrupt on the SD data pins.These errors sound hardware specific and you may need to define special diagnosis algorithms but perhaps there are certain classes of errors that can be diagnosed by a general-purpose algorithm.3) errors that only the framework can tell. E.g. the card is requesting an illegal voltage change, or the card has failed to generate a "relative card address" properly after several attempts. Clearly it would be nice if the framework could participate here.Absolutely. This is where the framework can detect and report errors (ereport events) and diagnose problems that are common for all components under its control w/o having to involve your consumers.Typically, what happens is you develop an error reporting interface (ala pci_fm_ereport_post()) for errors detected by the framework. You can use fm_ereport_post() (uts/common/os/fm.c) or ddi_fm_ereport_post() (uts/common/os/ddifm.c) as the underlying implementation. ddi_fm_ereport_post() is evolving whereas the interfaces in fm.c are project private. Think about the ereport classes and event payload your diagnosis software will need to work out what's wrong and design the interfaces accordingly.And just like for 2), you'll need come with the algorithms to do the diagnosis of these errors and which ASRUs (resources) are faulty.4) errors that the target driver can tell. E.g. a target-specific error in response to a block transfer. (E.g. an attempt to write a block to a protected sector.)I think you can punt here to the common sd FMA project.So now, you need to think about how you want to deliver your diagnosis software. The algorithms can range from simple (map an error to a fault) to complex. Some errors you may want to feed through serd engines such that a certain number of errors have to occur before a fault diagnosis is issued. Other diagnoses may rely upon the occurance of a particular combination of errors.In any case, there are two ways to code your diagnosis software. The first is by writing a set of eft diagnosis rules like you see for PCI or writing a C-based diagnosis fmd plugin that subscribes to your particular error reports (ereports).If most of your diagnoses are simple 1-to-1 mappings of errors to faults, eft is proabably your best bet. On the other hand, complicated algorithms can be tricky when using an eft rules set.What I would like to do is have some help/guidance in figuring out how to architect FMA for this kind of solution. I did see PCI support, but I'm not finding any other good examples of my kind of framework with FMA support. (Notably neither USB nor 1394 frameworks have FMA support.) Can anyone offer specific advice or documentation to read? I've read the published documentation that I could find, but it seemed pretty specific to leaf-drivers, and I'm not sure how to get something liek cases #2 and #3 handled properly.This should be as clear as mud by now. Instructions on how to develop a diagnosis plugin is described in the fmd PRM (see ttp://opensolaris.org/os/community/fm). For samples in developing ereport generation interfaces for your framework, search the OpenSolaris code for calls to fm_ereport_post(). The final thing you'll need to do is write a libtopo enumerator to tack on the SD topology (list of ASRU and resource instances controlled by the sdcard framework). The latest PRM describes libtopo and how to write an enumerator. There are also plenty of examples in the source (lib/fm/topo/modules).As far as your list of deliverables go, they will look something like: - specification of ereport events for sdcard framework for 3) - optional specification for controllers for 2) - ereport generation routine for sdcard framework for 3) - optional ereport generation routine for controller drivers for 2) - diagnosis plugin or eft rules for 3) and optionally 2) - libtopo enumerator for the sdcard topology Cindi _______________________________________________ fm-discuss mailing list fm-discuss@opensolaris.org_______________________________________________ fm-discuss mailing list fm-discuss@opensolaris.org
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ fm-discuss mailing list fm-discuss@opensolaris.org