-----Original Message----- From: linux-poweredge-bounces-Lists On Behalf Of Mark Watts Sent: Wednesday, October 20, 2010 9:38 AM To: linux-poweredge-Lists Subject: PowerEdge 1950 PCI Errors
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I have a PCI-X Intel PRO/1000 MT Quad Port Server Adapter in Slot 2 on a PowerEdge 1950. OS is CentOS 5.4. Shortly after enabling one of the ports for use on a 100Mbit network, NFS data transfer across that link stalled. All traffic through this interface seems to have ceased - even ping is timing out machines that were previously pingable. The following log entries are observed through OMSA: Status: OK Wed Oct 20 12:01:51 2010 Err Reg Pointer: Link Tuning sensor, OEM Diagnostic data event was asserted Status: Critical Wed Oct 20 12:01:51 2010 PCIE Fatal Err: Critical Event sensor, bus fatal error (Bus 0 Device 2 Function 0) was asserted Status: Critical Wed Oct 20 12:01:51 2010 PCI Parity Err: Critical Event sensor, PCI PERR (Slot 2) was asserted Similarly, the following errors are seen in dmesg/syslog: Uhhuh. NMI received for unknown reason 30 on CPU 0. Do you have a strange power saving mode enabled? Dazed and confused, but trying to continue EDAC MC0: UE row 0, channel-a= 0 channel-b= 1 labels "-": (Branch=0 DRAM-Bank=0 RDWR=Read RAS=0 CAS=0, UE Err=0x20 (Non-Aliased Uncorrectable Non-Mirrored Demand Data ECC)) EDAC MC0: UE row 0, channel-a= 0 channel-b= 1 labels "-": (Branch=0 DRAM-Bank=0 RDWR=Read RAS=0 CAS=0, UE Err=0x100 (Non-Aliased Uncorrectable Patrol Data ECC)) Can anyone enlighten me as to what's happened here? Do I have bad RAM, a bad Quad-Card, both or neither? Cheers, Mark. - -- Mark Watts BSc RHCE MBCS Senior Systems Engineer, IPR Secure Managed Hosting www.QinetiQ.com QinetiQ - Delivering customer-focused solutions GPG Key: http://www.linux-corner.info/mwatts.gpg -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/ iEYEARECAAYFAky+/tgACgkQBn4EFUVUIO1BSQCglNrufn0kODjEeVuxGeFjt4Bv 4LIAoPSuKzk7Mttd27aes5wAQb62wX2o =rC5Y -----END PGP SIGNATURE----- _______________________________________________ Linux-PowerEdge mailing list [email protected] https://lists.us.dell.com/mailman/listinfo/linux-poweredge Please read the FAQ at http://lists.us.dell.com/faq Hi Mark, Try reseating the PCI NIC or removing it for testing purposes. Make sure your firmware is up to date. The EDAC error message is not related to the PCIE Fatal Err message. EDAC should be disabled. The blacklist on RHEL5 should be under /etc/modprobe.d/; Alternately, you may add the following to /etc/modprobe.conf: alias i5000_edac /dev/null alias edac_mc /dev/null options edac_mc panic_on_ue=0 For a system already running with the edac module loaded: - run 'lsmod | grep -i edac'; should return 'i5000_edac' and 'edac_mc'; - run 'modprobe -r <modules>' where <modules> are the listed edac modules from the lsmod command - once the modules have been removed from the kernel, edac should be disabled (for this boot) EDAC is a kernel level driver, and it's talking directly to the chipset, reading registers, and then just dumping out raw register values. When it accesses these read-once registers they get cleared so no information will be collected/logged by the Dell ESM. Without this information being obtained by the Dell ESM, there will never be any [LCD -- hardware level] alerts if a warning or failure threshold is reached. Also, there are no 'screens' available that will clearly identify the component logged by EDAC whereas Dell ESM already has the ability to log and identify a "problematic" component. Additionally, EDAC is primarily an ECC memory reporting module so things like fans, temeratures, voltages, ... will not necessarily be caught and properly reported by EDAC (although it does report on some PCI bus parity events). EDAC was designed primarily for systems that do NOT have an "event managing" BIOS/BMC pair (like Dell servers do) yet the chipset can report errors such as SBE. At this time EDAC is not something supported or validated by Dell and as such it is recommended to NOT use EDAC. More information on EDAC: edac.txt http://lwn.net/Articles/168975/ EDAC Project http://bluesmoke.sourceforge.net/ Ben Gordy _______________________________________________ Linux-PowerEdge mailing list [email protected] https://lists.us.dell.com/mailman/listinfo/linux-poweredge Please read the FAQ at http://lists.us.dell.com/faq
