[clearview-discuss] FMA/networking post-UV

Cynthia McGuire Wed, 20 Feb 2008 13:29:37 -0800


Peter Memishian wrote:


> 
>  > >  2. It seems uneven to have retired networking hardware but not
>  > >     have anything reported by dladm -- minimally, I'd think it
>  > >     appropriate for show-phys to report this, and (given the
>  > >     severity) maybe show-link as well.  (However, I don't want
>  > >     dladm to impinge on fmadm's duties.)
>  > 
>  > It's always a good thing for participating subsystems to report
>  > enriched fault status for their resources, since by definition such
>  > reporting can always be somewhat more useful and better than the
>  > generic FMA view.  The key is to make it connect to the FMA output
>  > (e.g. msgid values).  Examples of this today include svcs -x
>  > and zpool status -x.  Making dladm do same would be a good thing.
> 
> Great. 

One thing to be mindful of is that zpool and svcs both hard-code the 
msg-IDs and keep track of fault status independently of fmd.  zfs is a 
bit better in that there is a zfs agent that connects zpool status with 
what fmd is reporting and doing (i.e. retire or repair).

In dladm, you will probably have to use libtopo to convert interface 
names to a FMRI and then some fmd project private interfaces to get 
fault information on the interfaces in question.

Ideally, what we want is a single entry point for software like dladm to 
query fault and FMRI information via libtopo.  So, given an interface 
name (just a property for a topo FMRI node), return the fault status. 
This should be a very easy extension to libtopo and some additional 
properties on topo FMRI nodes for network interfaces.

If you're interested in pursuing this, let Rob Johnston know.

> 
>  > >  3. It worries me that in all the cases we've seen thus far, the
>  > >     fault was "repaired" and never seen again.  Is this common, or
>  > >     is this indicative of bugs in our fault detection code?
>  > 
>  > Do you have an example?  i.e. what fault was diagnosed?
> 
> For instance, per 6664330, there was a PCI express fault
> [http://sun.com/msg/PCIEX-8000-0A] in early December, but it went
> unnoticed and the networking device continued to be used without incident
> (until the eventual upgrade to build 83).
> 

I believe there are a number of instances in PCI and the hardened PCI 
leaf drivers where hard faults are diagnosed for conditions that may not 
immediately or subsequently impact the device or operating system.  In 
some instances, you may never see the condition again.  Please read the 
details section of the diagnosis article (http://www.sun.com/msg).  A 
number of PCI articles point out that some of the faults can be caused 
by poorly seated controllers and recommend unplugging and plugging it 
back in.

But if you suspect that a device is being diagnosed as faulty 
incorrectly or too soon, please notify the driver developer or open a 
bug.  The last thing we want is a bunch of NTF controllers getting 
returned to the respective manufacturers.

[clearview-discuss] FMA/networking post-UV

Reply via email to