Re: [fm-discuss] HELP! FMA stuff for nexus/framework drivers?

Garrett D'Amore Wed, 20 Feb 2008 13:30:28 -0800

Hmm... going back through the log of messages, I unfortunately have to 
confess that I don't feel particularly closer to understanding exactly 
what it is I need to do.  The level of complexity associated with 
getting from the point where I've identified a symptom/problem in the 
drivers, to getting to an FMA event, seems quite contorted.  And again, 
the lack of readily available examples such as for USB or firewire, 
complicates it even further.  (And chapter 13 of WDD is totally oriented 
towards leaf drivers, and offers very little help for nexus drivers or 
frameworks.)  Furthermore, the somewhat conflicting (or apparently so) 
information I've received from multiple FMA team members isn't helping 
me to wrap my head around this problem any more easily.


I gather I need to write a libtopo plugin, but there are limitations and 
problems associated, since as I've already indicated, the most likely 
problems to occur (by far) will occur at device attach time -- before a 
node has been created in the device tree!  So it still isn't clear to me 
how to solve this -- and the messages I've received from the FMA folks 
seem to indicate that it is still an "unsolved" problem.

I also gather that I have to figure out how to enhance my framework so 
that faults that are reported result in ereports being generated.

Then it appears that I have to write some userland code to do something 
meaningful with those reports.  (Which really, is pretty much 100% just 
stuffing a message in a system log somewhere, because there is damned 
little else I can do -- the device will already have been removed from 
service by the SDcard framework.)

Frankly, the amount of complexity and utter pain and suffering 
surrounding figuring this out defies logic -- particularly for what is 
essentially a very simple and non-mission-critical subsystem like the 
SDcard slots found on laptops, which has basically trivially diagnosis 
needs (when a fault occurs, you remove and reinsert the card.  If the 
problem persists, you throw the memory card away and buy another one.  
Kind of a no-brainer.)

I've asked CTeam for a waiver for FMA.  I gather they're not very happy 
about this.  It would be a shame, however, IMO, to defer delivery of 
some basic and useful functionality while I attempt to obtain a 
doctorate in fault diagnosis and predictive self-healing in order to 
figure out exactly how to participate properly in FMA.

Just for the curious, here are some of the concepts that the poor 
schmuck like me who wants to do this has to figure out:

* FRUs
* ASRUs
* Labels?
* SERD
* libtopo
* fmad
* ereports
* FMRIs
* resource cache
* event transports...
* enumeration schemes?
* topology maps
* diagnosis engines
* agents
* Eversholt

(and somewhere after there my head exploded....)

Is there anyone here who understands all this, and can take an afternoon 
or two to sit and work through this with me.  Trying to glean the 
information I need from the 200+ page fmd PRM is, uhm, painful.

Alternatively, can I just punt on solving this particular batch of 
headaches until at least one other bus framework which is more similar 
to mine, and probably a lot more mission critical (such as USB) has been 
implemented?

Feeling somewhat overwhelmed....

    -- Garrett

Chris Horne wrote:
> Hi Stephen
>
>   
>> Cynthia McGuire wrote:
>>     
>>> Garrett D'Amore wrote:
>>>
>>>       
>>>> Thanks for the good advice that folks have given me.  I still have a 
>>>> few more questions.
>>>>
>>>> 1) Many errors are likely to occur or be detected at hotplug time.  
>>>> That is, when the SDcard is first inserted.  Generally, this means 
>>>> that the given SDcard will not be initialized, and cannot be 
>>>> accessed.  What are the expectations for FMA here?  I clearly can't 
>>>> use the SDcard device itself in the topology, because it doesn't 
>>>> exist (although the slot does exist).  Apart from the fact that the 
>>>> administrator never got a chance to use the card in the first place, 
>>>> there really isn't any loss of service.  (The service was never 
>>>> delivered in the first place.)  Is FMA still the right answer here?  
>>>> (Note that a lot of the detectable errors might just be someone 
>>>> trying to use a card that is not supported by the slot, so its not 
>>>> really a fault so much as just user-error.)
>>>>         
>>> I suppose these are not faults in the sense that there is broken 
>>> hardware but you may want to send an 'alert'.  An 'alert' is defined 
>>> in the phase I of the Sensor Abstraction project.  An alert event 
>>> doesn't indict something broken but rather alerts the admin to 
>>> something out of range.  In any case, I think you can model the 
>>> topology with slots containing cards much like we do for disk drives 
>>> and their bays.  The fault or alert event could then point to the slot 
>>> rather than a card that doesn't exist.
>>>
>>>       
>> Generally a driver's attach routine should be able to distinguish 
>> between failure due to an administrator error (eg unrecognised 
>> deviceid)  and a real hardware fault (device hang, parity error on bus), 
>> I think the latter case is potentially quite likely - if there is a hard 
>> fault on the card it is quite likely to show up during attach. If your 
>> driver can detect genuine hardware faults during attach it should report 
>> them and they should be diagnosed to faults so that the appropriate 
>> service action can be raised.
>>
>> The other case Cindi mentions (raising an"alert" for cases where there 
>> is no hardware fault) is part of an upcoming project, so I guess that's 
>> more for the future,
>>
>> There is certainly a problem with devices that fail to attach not being 
>> in the topology.  I've been discussing this with Vikram to see if we can 
>> fix this (if the node got as far as the init state it may still be 
>> possible to detect it).
>>     
>
> I wanted to give some more perspective on attach-time telemetry
> from the perspective of storage FMA - since it is different.
>
> The libtopo topology for storage does not continue out from a
> pci-fn to the device. The DE front end code (eversholt) will
> be matching topology/config using devid in the topology instead
> of path through the topology (the path is in the ereport too,
> but the devid takes precedence).  The topology node with the devid
> is associated with 'disk' in an /enclosure/bay/disk/etc
> tree oriented structure. For non-supported topologies eversholt
> will silently discard (given new .esc ereport property
> discard_if_config_unknown is used).
>
> A side-effect of this is that the ability to match storage
> ereports to topology by devid has a looser relationship to the
> devinfo state model.  For path oriented match, DS_INITIALIZED
> always establishes the devi_addr prior to attach(9E). For
> devid oriented match, the dependency is on ddi_devid_register(9F)
> use - called by either target driver attach(9E), or by HBA during
> tran_tgt_init(9E) processing off initchild (for some transports),
> or (in rare cases) via sun-cluster ioctl.  The earlier the devid
> can be registered, the better.
>
> Storage ereports with a devid are only generated when we are
> successfully communicating with the device, and are positive
> of the device identity. This means that early-attach failures
> prior to devid registration look a bit like a transport ereport
> (they have no devid, and can't map to topology), and bit
> like a device-as-detector ereport (they may have request sense
> ASC/ASCQ info, which only comes from talking to the device).
> The plan (not implemented, some hand-waving) is for the
> non-eversholt transport DE we are working to subscribe to
> device-as-detectors ereports classes (normally handled by
> eversholt). If a device-as-detector ereport comes in without
> a devid, it will be re-published with the devid last registered
> for the path (no devid -> discard).  For supported topologies,
> adding the devid will allow eversholt to process the ereport.
>
> -Chris
>
>   
>>>> 2) The most reasonable response to most of the errors that the SDcard 
>>>> framework can detect is simply to offline the failing card.  I don't 
>>>> think I want to wait until some userland agent does this -- I'd feel 
>>>> a lot better if the offline/retire action took place in the kernel, 
>>>> as quickly as possible.  (Mostly because I don't want the framework 
>>>> then trying to continue to access the failing device.)  
>>>>         
>>> The FMA does permit immediate error handling following detection of an 
>>> error when the system or user data may be compromised.  For example, a 
>>> hardened drive may want to discontinue using a particular device 
>>> instance after detecting a fatal error.  This is preferable in some 
>>> situations to a panic.  Post-diagnosis, agents can decide whether or 
>>> not the error handling action was correct.  For example, the diagnosis 
>>> software could determine that the wrong device was offlined and make a 
>>> correction.
>>>
>>>       
>> You probably ought to read Vikram's IO retire spec (PSARC 2007/290). He 
>> has a number of mechanisms for isolating a device such as "fencing", 
>> which maybe you could use?
>>
>> Steve
>>
>>
>>     
>>> The key thing is that you not embed complex diagnosis in your 
>>> framework or driver.  Try to separate what needs to happen immediately 
>>> and what can wait until diagnosis gives a clear picture of the problem.
>>>
>>> So, if the framework
>>>
>>>       
>>>> does this, what kind of topology should I report against?  The slot, 
>>>> or the card itself?
>>>>
>>>>         
>>> The thing that's broken which sounds like is the card.
>>>
>>>
>>>       
>>>> 3) That leads to the next course, which is how to handle recovery.  
>>>> My gut feeling is that the recovery action for errors should be:
>>>>
>>>>   a) the user removes and reinserts the card (or a different card)
>>>>   b) the user uses cfgadm -x reset-slot to reset the slot and the card
>>>>
>>>>         
>>> These sound like a possible repair actions that you will describe in 
>>> your knowledge articles.
>>>
>>>
>>>       
>>>> Note that I don't think automated recovery action in fmad is 
>>>> necessarily a good idea.
>>>>
>>>>         
>>> That's fine, although you may need to disable the IO retire agent from 
>>> taking its default actions.
>>>
>>>
>>>       
>>>> 4) SDcard as a bus, doesn't have the notion of DMA or bus mapping.  
>>>> So access handle checking makes little sense to me.  But I'm 
>>>> imagining that the errors that can be detected (e.g. a 
>>>> protocol/signaling error) might need to be reported to child 
>>>> drivers.  But then again, the recovery action is generally to just 
>>>> report a synchronous failure to the child (e.g. SDA_EIO or 
>>>> somesuch).  If I've done that, do I also need to go thru the trouble 
>>>> of propagating these errors to child nodes?  (Generally the child 
>>>> node is going to be taken offline anyway, although it may refuse to 
>>>> the associated ddi-detach, but if it continues to try to perform I/O, 
>>>> right now I wind up returning a generic SDA_EFAULTED error, 
>>>> indicating that the slot is in a faulted state and IO is not possible.)
>>>>         
>>> It depends if you want to permit the child instances to report any 
>>> errors of their own.  That's the purpose of the error reporting chain 
>>> in PCI and the DDI DMA routines.  Because errors and controllers cross 
>>> interface boundaries, providing an error reporting chain permits those 
>>> errors to be reported  before the device is taken offline.  I don't 
>>> really know enough about the technology to say which is the best 
>>> approach.
>>>
>>>
>>>       
>>>> 5) Of course, SD slot controllers are themselves on busses which have 
>>>> DMA and registers, so the parent slot driver will be checking access 
>>>> handles, detecting PCI bus errors, etc.  How, if at all, would these 
>>>> be reported to the child driver.  Again, the child driver has no 
>>>> access handles itself.  I'm kind of thinking that just returning 
>>>> errors synchronously (in response to commands), combined with a 
>>>> ereport posted upstream from the slot, is adequate.  But am I missing 
>>>> something?
>>>>         
>>> Passing error information in-band via the command work should work 
>>> just fine.
>>>
>>>
>>>       
>>>> Thoughts?  Am I making sense?  Am I understanding things clearly?
>>>>         
>>> Yes, it sounds like you're on the right track!
>>>
>>>
>>>       
>>>> Note that I think a lot of these similar issues would show up if FMA 
>>>> was ever applied to e.g. USB.
>>>>         
>>> Absolutely.
>>>
>>> Cindi
>>>
>>>       
>> _______________________________________________
>> fm-discuss mailing list
>> fm-discuss@opensolaris.org
>>     
>
>   

_______________________________________________
fm-discuss mailing list
fm-discuss@opensolaris.org

Re: [fm-discuss] HELP! FMA stuff for nexus/framework drivers?

Reply via email to