These errors indicate there is a stick of memory in the system throwing
correctable ECC errors. If only a few of these messages appear in, say,
a week, then you're probably OK (solar radiation doing bit flips or
something). But if they're appearing regularly, then you have a bad
dimm that you should replace before you get an uncorrectable error and
the kernel panics.
Also, the row and channel numbers tend to be difficult to trust, though
the CPU number (MC1 = CPU1, the second CPU) is generally reliable.
You can also check out the useful info in /sys/devices/system/edac to
see if there are uncorrectable or correctable ECC errors on various dimms.
Paul Krizak 5900 E. Ben White Blvd. MS 625
Advanced Micro Devices Austin, TX 78741
Linux/Unix Systems Engineering Phone: (512) 602-8775
Silicon Design Division Cell: (512) 791-0686
Jos Vos wrote:
Hi,
On one AMD Opteron 265 system I once saw these error messages:
Aug 14 03:07:31 node1 kernel: EDAC k8 MC1: general bus error: participating
processor(local node response), time-out(no timeout) memory transaction
type(generic read), mem or i/o(mem access), cache level(generic)
Aug 14 03:07:31 node1 kernel: EDAC MC1: CE page 0x28b11e, offset 0x308, grain 8, syndrome
0x2c, row 0, channel 1, label "": k8_edac
Aug 14 03:07:31 node1 kernel: EDAC k8 MC1: extended error code: ECC error
Aug 14 03:09:24 node1 kernel: EDAC k8 MC1: general bus error: participating
processor(local node origin), time-out(no timeout) memory transaction
type(generic read), mem or i/o(mem access), cache level(generic)
Aug 14 03:09:24 node1 kernel: EDAC MC1: CE page 0x28b11e, offset 0x308, grain 8, syndrome
0x2c, row 0, channel 1, label "": k8_edac
Aug 14 03:09:24 node1 kernel: EDAC k8 MC1: extended error code: ECC error
Aug 14 03:20:26 node1 kernel: EDAC k8 MC1: general bus error: participating
processor(local node origin), time-out(no timeout) memory transaction
type(generic read), mem or i/o(mem access), cache level(generic)
Aug 14 03:20:26 node1 kernel: EDAC MC1: CE page 0x28b11e, offset 0x308, grain 8, syndrome
0x2c, row 0, channel 1, label "": k8_edac
Aug 14 03:20:26 node1 kernel: EDAC k8 MC1: extended error code: ECC error
This was still with kernel 8.1.8.el5, if that matters (now I'm running
8.1.14.el5). For the rest, I have not seen any error messages and the
system seems to work ok. Should I start worrying?
Regards,
--
-- Jos Vos <[EMAIL PROTECTED]>
-- X/OS Experts in Open Systems BV | Phone: +31 20 6938364
-- Amsterdam, The Netherlands | Fax: +31 20 6948204
_______________________________________________
rhelv5-list mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/rhelv5-list
_______________________________________________
rhelv5-list mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/rhelv5-list