[clearview-discuss] FMA/networking post-UV

Mike Shapiro Tue, 19 Feb 2008 23:28:50 -0800

On Wed, Feb 20, 2008 at 12:35:06AM -0500, Peter Memishian wrote:
> 
> Mike/Cindi,
> 
> As you may recall, one of the problems with "style 2" DLPI datalinks is
> that opening them bypasses the FMA I/O retire checks in spec_open() (since
> the kernel doesn't know what piece of hardware is actually being accessed
> until the DL_ATTACH_REQ is done).
> 
> However, the /dev/net directory introduced by the recent Clearview UV
> putback consists only of "style 1" DLPI links, which the spec_open()
> checks correctly catch, causing ENXIO to be returned.  Since all libdlpi
> applications check /dev/net first, these style-1 links are now preferred.
> >From a RAS standpoint, this is a marked improvement.  However, we've
> already encountered a handful of systems with a network device that
> apparently mostly worked (even though FMA had retired it) which failed to
> open with ENXIO after upgrading to the UV bits.  Of course, once the user
> runs "fmadm faulty", everything falls into place -- but to most, the
> connection between the ENXIO error and FMA may not occur (especially since
> FMA may have done the retire months ago).  I fear this will lead to
> support calls and frustration.
> 
> As such, I had a few points I wanted your input on:
> 
>       1. Has there been any discussion of a new errno for this case?
>          If we had a new errno, such as ERETIRED or EFAULTED, API
>          consumers could differentiate this case if appropriate, and
>          moreover strerror() could say something more helpful than "No
>          such device or address".


Not generally due to the long-standing murky issue of whether
adding new Solaris specific errno values is a thing we do or not.
But personally I have no issue with that.  Another approach would
be just to have dladm deal with it -- i.e. if it gets ENXIO then
make additional calls to realize something is faulty as opposed
to unattached.  It is already the case that ENXIO is overloaded:
e.g. driver failed to attach versus nothing actually there.

>       2. It seems uneven to have retired networking hardware but not
>          have anything reported by dladm -- minimally, I'd think it
>          appropriate for show-phys to report this, and (given the
>          severity) maybe show-link as well.  (However, I don't want
>          dladm to impinge on fmadm's duties.)

It's always a good thing for participating subsystems to report
enriched fault status for their resources, since by definition such
reporting can always be somewhat more useful and better than the
generic FMA view.  The key is to make it connect to the FMA output
(e.g. msgid values).  Examples of this today include svcs -x
and zpool status -x.  Making dladm do same would be a good thing.
 
>       3. It worries me that in all the cases we've seen thus far, the
>          fault was "repaired" and never seen again.  Is this common, or
>          is this indicative of bugs in our fault detection code?

Do you have an example?  i.e. what fault was diagnosed?

-Mike

-- 
Mike Shapiro, Solaris Kernel Development. blogs.sun.com/mws/

[clearview-discuss] FMA/networking post-UV

Reply via email to