[clearview-discuss] FMA/networking post-UV

Peter Memishian Wed, 20 Feb 2008 03:02:08 -0500

 > > As such, I had a few points I wanted your input on:
 > > 
 > >    1. Has there been any discussion of a new errno for this case?
 > >       If we had a new errno, such as ERETIRED or EFAULTED, API
 > >       consumers could differentiate this case if appropriate, and
 > >       moreover strerror() could say something more helpful than "No
 > >       such device or address".
 > 
 > Not generally due to the long-standing murky issue of whether
 > adding new Solaris specific errno values is a thing we do or not.


Looking at errno.h, I see some recent additions -- e.g. extended
accounting added ENOTACTIVE, and a number of error codes were added for
robust mutexes.  So there seems to be precedent.  (I understand that we
are stuck with at most 151 error codes from now until another major
release, and thus need to be cautious -- but if push came to shove
someday, I'd hope we could recycle ENOANO and other useless junk ;-)

 > But personally I have no issue with that.

Cool.  To be clear, we're not in a position to make these changes right
now, so there's plenty of time to discuss this.

 > Another approach would be just to have dladm deal with it -- i.e. if it
 > gets ENXIO then make additional calls to realize something is faulty as
 > opposed to unattached.  It is already the case that ENXIO is
 > overloaded: e.g. driver failed to attach versus nothing actually there.

I agree that's possible, though of course it doesn't improve things for
commands that will just do a strerror(errno) after the failed open().

 > >    2. It seems uneven to have retired networking hardware but not
 > >       have anything reported by dladm -- minimally, I'd think it
 > >       appropriate for show-phys to report this, and (given the
 > >       severity) maybe show-link as well.  (However, I don't want
 > >       dladm to impinge on fmadm's duties.)
 > 
 > It's always a good thing for participating subsystems to report
 > enriched fault status for their resources, since by definition such
 > reporting can always be somewhat more useful and better than the
 > generic FMA view.  The key is to make it connect to the FMA output
 > (e.g. msgid values).  Examples of this today include svcs -x
 > and zpool status -x.  Making dladm do same would be a good thing.

Great. 

 > >    3. It worries me that in all the cases we've seen thus far, the
 > >       fault was "repaired" and never seen again.  Is this common, or
 > >       is this indicative of bugs in our fault detection code?
 > 
 > Do you have an example?  i.e. what fault was diagnosed?

For instance, per 6664330, there was a PCI express fault
[http://sun.com/msg/PCIEX-8000-0A] in early December, but it went
unnoticed and the networking device continued to be used without incident
(until the eventual upgrade to build 83).

-- 
meem

[clearview-discuss] FMA/networking post-UV

Reply via email to