Thanks for the good advice that folks have given me.  I still have a few 
more questions.

1) Many errors are likely to occur or be detected at hotplug time.  That 
is, when the SDcard is first inserted.  Generally, this means that the 
given SDcard will not be initialized, and cannot be accessed.  What are 
the expectations for FMA here?  I clearly can't use the SDcard device 
itself in the topology, because it doesn't exist (although the slot does 
exist).  Apart from the fact that the administrator never got a chance 
to use the card in the first place, there really isn't any loss of 
service.  (The service was never delivered in the first place.)  Is FMA 
still the right answer here?  (Note that a lot of the detectable errors 
might just be someone trying to use a card that is not supported by the 
slot, so its not really a fault so much as just user-error.)

2) The most reasonable response to most of the errors that the SDcard 
framework can detect is simply to offline the failing card.  I don't 
think I want to wait until some userland agent does this -- I'd feel a 
lot better if the offline/retire action took place in the kernel, as 
quickly as possible.  (Mostly because I don't want the framework then 
trying to continue to access the failing device.)  So, if the framework 
does this, what kind of topology should I report against?  The slot, or 
the card itself?

3) That leads to the next course, which is how to handle recovery.  My 
gut feeling is that the recovery action for errors should be:

    a) the user removes and reinserts the card (or a different card)
    b) the user uses cfgadm -x reset-slot to reset the slot and the card

Note that I don't think automated recovery action in fmad is necessarily 
a good idea.

4) SDcard as a bus, doesn't have the notion of DMA or bus mapping.  So 
access handle checking makes little sense to me.  But I'm imagining that 
the errors that can be detected (e.g. a protocol/signaling error) might 
need to be reported to child drivers.  But then again, the recovery 
action is generally to just report a synchronous failure to the child 
(e.g. SDA_EIO or somesuch).  If I've done that, do I also need to go 
thru the trouble of propagating these errors to child nodes?  (Generally 
the child node is going to be taken offline anyway, although it may 
refuse to the associated ddi-detach, but if it continues to try to 
perform I/O, right now I wind up returning a generic SDA_EFAULTED error, 
indicating that the slot is in a faulted state and IO is not possible.)

5) Of course, SD slot controllers are themselves on busses which have 
DMA and registers, so the parent slot driver will be checking access 
handles, detecting PCI bus errors, etc.  How, if at all, would these be 
reported to the child driver.  Again, the child driver has no access 
handles itself.  I'm kind of thinking that just returning errors 
synchronously (in response to commands), combined with a ereport posted 
upstream from the slot, is adequate.  But am I missing something?

Thoughts?  Am I making sense?  Am I understanding things clearly?

Note that I think a lot of these similar issues would show up if FMA was 
ever applied to e.g. USB.
Thanks again!

    -- Garrett
_______________________________________________
fm-discuss mailing list
fm-discuss@opensolaris.org

Reply via email to