Chris Worley wrote:
On Wed, Jul 8, 2009 at 1:10 PM, Erwin Tsaur<erwin.ts...@sun.com> wrote:
Chris Worley wrote:
(Sorry for the misleading "Subject" in the initial post.  would like
to know a more appropriate place to post, since fm is just the
messenger here.)

More to add: fmadm faulty may be saying something about a bad PCIe
slot or device (is there an "lspci" in OpenSolaris?):

# fmadm faulty
--------------- ------------------------------------  --------------
---------
TIME            EVENT-ID                              MSG-ID
SEVERITY
--------------- ------------------------------------  --------------
---------
Jul 07 07:55:42 016cf20c-d572-42c1-f217-9eb8d439b73c  PCIEX-8000-KP  Major

Fault class : fault.io.pciex.device-interr-corr max 29%
             fault.io.pciex.bus-linkerr-corr max 14%
Affects     : dev:////p...@0,0/pci8086,3...@1/pci15d9,1...@0
             dev:////p...@0,0/pci8086,3...@1/pci15d9,1...@0,1
             dev:////p...@0,0/pci8086,3...@1
                 faulted but still in service
FRU         : "MB"

(hc://:product-id=X8DTH-i-6-iF-6F:chassis-id=1234567890:server-id=opensolaris/motherboard=0)
                 faulty

Description : Too many recovered bus errors have been detected, which
indicates
             a problem with the specified bus or with the specified
             transmitting device. This may degrade into an unrecoverable
             fault.
             Refer to http://sun.com/msg/PCIEX-8000-KP for more
information.

Response    : One or more device instances may be disabled

Impact      : Loss of services provided by the device instances associated
with
             this fault

Action      : If a plug-in card is involved check for badly-seated cards
or
             bent pins. Otherwise schedule a repair procedure to replace
the
             affected device.  Use fmadm faulty to identify the device or
             contact Sun for support.

How bad is this error?  I need to put some adapters in, but it sounds
like the OS doesn't handle the NHM's IOH (or is it really detaining a
HW issue?).

OS does handle these issues and unfortunately it is a HW issue.  This is
likely to eventually cause your system to panic or fill up your hard drive.
 Assuming you are seeing a lot of btlp and rto errors..  If anything these
errors are performance killer.  Not only is the RTO/BTLP error telling you
that many packets require retransmit, the OS also has to constantly go out
and scan and clean up the fabric.

This system is triple boot: RHEL5.3, W2008S, and OpenSolaris.

The errors in OpenSolaris occur if no cards are installed in the bus.

The other OSes don't report any errors w/ or w/o cards in the bus.
This doesn't happen when there are no cards installed, since the error is literally complaining about a packets received between 2 devices. Are you sure it's you are correctly identifying the right slot?

I believe only OpenSolaris even detects these errors, which is why the other OSes don't report any errors. It doesn't mean that errors aren't occurring though.
It would also be nice to throttle the errlog so it doesn't fill the
disk an hour after boot.  Is this possible?

no throttling possible, but you could turn it off, though highly not
recommended, it's better to fix the issue.  It really could just be a badly
seated card.

How do I disable the errors?
We need to figure out exactly what your error is first, please provide the "fmdump -eV" log. If it is huge, just tail the last 500-1000 lines should be enough.
Thanks,

Chris
Thanks,

Chris
On Wed, Jul 8, 2009 at 12:40 PM, Chris Worley<worl...@gmail.com> wrote:

Please tell me if this is the wrong group to post to (including a
better group to post to)...

 I just upgraded a dual socket 3.1GHz NHM-based system:

http://supermicro.com/products/motherboard/QPI/5500/X8DTH-6F.cfm

...in order to get the latest igb driver to recognize the NIC.

The upgrade worked for that, but on boot, the cylon-stare
"OpenSolaris" splash screen doesn't go away w/o hitting "escape", and
I get a message "svc.startd: system/xvm/ipagent: default failed
repeatedly" and  "...failed to abandon contract 66: permission denied"
in the console.

"svcs -xv" returns nothing.

/var/fm/fmd/errlog is growing out of control, and "fmdump -e" is
spewing hundreds of messages per second, like:

Jul 08 11:17:04.3593 ereport.io.pciex.dl.btlp
Jul 08 11:17:05.0165 ereport.io.pci.fabric
Jul 08 11:17:04.3595 ereport.io.pciex.dl.rto
Jul 08 11:17:04.3595 ereport.io.pciex.rc.ce-msg

Fmdump doesn't give me much more to work with:

# fmdump  ;fmdump  -eVu 016cf20c-d572-42c1-f217-9eb8d439b73c
TIME                 UUID                                 SUNW-MSG-ID
Jul 07 07:55:42.6832 016cf20c-d572-42c1-f217-9eb8d439b73c PCIEX-8000-KP
TIME                           CLASS

/var/adm/messages doesn't show any errors.

I had other issues w/ the MGA driver. It worked before the upgrade,
but not after.  deleting the driver defaults to the vesa driver, which
works.  I don't know if that's salient to this issue, but thought I'd
make sure to relay it.

Can anybody tell me what's wrong, how to fix it, or how I should
investigate further?

Thanks,

Chris


_______________________________________________
fm-discuss mailing list
fm-discuss@opensolaris.org


_______________________________________________
fm-discuss mailing list
fm-discuss@opensolaris.org

Reply via email to