These errors indicate there is a stick of memory in the system throwing correctable ECC errors. If only a few of these messages appear in, say, a week, then you're probably OK (solar radiation doing bit flips or something). But if they're appearing regularly, then you have a bad dimm that you should replace before you get an uncorrectable error and the kernel panics.

Also, the row and channel numbers tend to be difficult to trust, though the CPU number (MC1 = CPU1, the second CPU) is generally reliable.

You can also check out the useful info in /sys/devices/system/edac to see if there are uncorrectable or correctable ECC errors on various dimms.

Paul Krizak                         5900 E. Ben White Blvd. MS 625
Advanced Micro Devices              Austin, TX  78741
Linux/Unix Systems Engineering      Phone: (512) 602-8775
Silicon Design Division             Cell:  (512) 791-0686


Jos Vos wrote:
Hi,

On one AMD Opteron 265 system I once saw these error messages:

Aug 14 03:07:31 node1 kernel: EDAC k8 MC1: general bus error: participating 
processor(local node response), time-out(no timeout) memory transaction 
type(generic read), mem or i/o(mem access), cache level(generic)
Aug 14 03:07:31 node1 kernel: EDAC MC1: CE page 0x28b11e, offset 0x308, grain 8, syndrome 
0x2c, row 0, channel 1, label "": k8_edac
Aug 14 03:07:31 node1 kernel: EDAC k8 MC1: extended error code: ECC error
Aug 14 03:09:24 node1 kernel: EDAC k8 MC1: general bus error: participating 
processor(local node origin), time-out(no timeout) memory transaction 
type(generic read), mem or i/o(mem access), cache level(generic)
Aug 14 03:09:24 node1 kernel: EDAC MC1: CE page 0x28b11e, offset 0x308, grain 8, syndrome 
0x2c, row 0, channel 1, label "": k8_edac
Aug 14 03:09:24 node1 kernel: EDAC k8 MC1: extended error code: ECC error
Aug 14 03:20:26 node1 kernel: EDAC k8 MC1: general bus error: participating 
processor(local node origin), time-out(no timeout) memory transaction 
type(generic read), mem or i/o(mem access), cache level(generic)
Aug 14 03:20:26 node1 kernel: EDAC MC1: CE page 0x28b11e, offset 0x308, grain 8, syndrome 
0x2c, row 0, channel 1, label "": k8_edac
Aug 14 03:20:26 node1 kernel: EDAC k8 MC1: extended error code: ECC error

This was still with kernel 8.1.8.el5, if that matters (now I'm running
8.1.14.el5).  For the rest, I have not seen any error messages and the
system seems to work ok.  Should I start worrying?

Regards,

--
--    Jos Vos <[EMAIL PROTECTED]>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204

_______________________________________________
rhelv5-list mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/rhelv5-list



_______________________________________________
rhelv5-list mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/rhelv5-list

Reply via email to