Garrett D'Amore wrote:
> Thanks for the good advice that folks have given me.  I still have a few 
> more questions.
> 
> 1) Many errors are likely to occur or be detected at hotplug time.  That 
> is, when the SDcard is first inserted.  Generally, this means that the 
> given SDcard will not be initialized, and cannot be accessed.  What are 
> the expectations for FMA here?  I clearly can't use the SDcard device 
> itself in the topology, because it doesn't exist (although the slot does 
> exist).  Apart from the fact that the administrator never got a chance 
> to use the card in the first place, there really isn't any loss of 
> service.  (The service was never delivered in the first place.)  Is FMA 
> still the right answer here?  (Note that a lot of the detectable errors 
> might just be someone trying to use a card that is not supported by the 
> slot, so its not really a fault so much as just user-error.)

I suppose these are not faults in the sense that there is broken 
hardware but you may want to send an 'alert'.  An 'alert' is defined in 
the phase I of the Sensor Abstraction project.  An alert event doesn't 
indict something broken but rather alerts the admin to something out of 
range.  In any case, I think you can model the topology with slots 
containing cards much like we do for disk drives and their bays.  The 
fault or alert event could then point to the slot rather than a card 
that doesn't exist.

> 
> 2) The most reasonable response to most of the errors that the SDcard 
> framework can detect is simply to offline the failing card.  I don't 
> think I want to wait until some userland agent does this -- I'd feel a 
> lot better if the offline/retire action took place in the kernel, as 
> quickly as possible.  (Mostly because I don't want the framework then 
> trying to continue to access the failing device.)  

The FMA does permit immediate error handling following detection of an 
error when the system or user data may be compromised.  For example, a 
hardened drive may want to discontinue using a particular device 
instance after detecting a fatal error.  This is preferable in some 
situations to a panic.  Post-diagnosis, agents can decide whether or not 
the error handling action was correct.  For example, the diagnosis 
software could determine that the wrong device was offlined and make a 
correction.

The key thing is that you not embed complex diagnosis in your framework 
or driver.  Try to separate what needs to happen immediately and what 
can wait until diagnosis gives a clear picture of the problem.

So, if the framework
> does this, what kind of topology should I report against?  The slot, or 
> the card itself?
> 

The thing that's broken which sounds like is the card.

> 3) That leads to the next course, which is how to handle recovery.  My 
> gut feeling is that the recovery action for errors should be:
> 
>    a) the user removes and reinserts the card (or a different card)
>    b) the user uses cfgadm -x reset-slot to reset the slot and the card
> 

These sound like a possible repair actions that you will describe in 
your knowledge articles.

> Note that I don't think automated recovery action in fmad is necessarily 
> a good idea.
> 

That's fine, although you may need to disable the IO retire agent from 
taking its default actions.

> 4) SDcard as a bus, doesn't have the notion of DMA or bus mapping.  So 
> access handle checking makes little sense to me.  But I'm imagining that 
> the errors that can be detected (e.g. a protocol/signaling error) might 
> need to be reported to child drivers.  But then again, the recovery 
> action is generally to just report a synchronous failure to the child 
> (e.g. SDA_EIO or somesuch).  If I've done that, do I also need to go 
> thru the trouble of propagating these errors to child nodes?  (Generally 
> the child node is going to be taken offline anyway, although it may 
> refuse to the associated ddi-detach, but if it continues to try to 
> perform I/O, right now I wind up returning a generic SDA_EFAULTED error, 
> indicating that the slot is in a faulted state and IO is not possible.)

It depends if you want to permit the child instances to report any 
errors of their own.  That's the purpose of the error reporting chain in 
PCI and the DDI DMA routines.  Because errors and controllers cross 
interface boundaries, providing an error reporting chain permits those 
errors to be reported  before the device is taken offline.  I don't 
really know enough about the technology to say which is the best approach.

> 
> 5) Of course, SD slot controllers are themselves on busses which have 
> DMA and registers, so the parent slot driver will be checking access 
> handles, detecting PCI bus errors, etc.  How, if at all, would these be 
> reported to the child driver.  Again, the child driver has no access 
> handles itself.  I'm kind of thinking that just returning errors 
> synchronously (in response to commands), combined with a ereport posted 
> upstream from the slot, is adequate.  But am I missing something?

Passing error information in-band via the command work should work just 
fine.

> 
> Thoughts?  Am I making sense?  Am I understanding things clearly?

Yes, it sounds like you're on the right track!

> 
> Note that I think a lot of these similar issues would show up if FMA was 
> ever applied to e.g. USB.

Absolutely.

Cindi

_______________________________________________
fm-discuss mailing list
fm-discuss@opensolaris.org

Reply via email to