Jarod Wilson
Fri, 12 Oct 2007 09:06:39 -0700
Jos Vos wrote: > Hi, > > On one AMD Opteron 265 system I once saw these error messages: > > Aug 14 03:07:31 node1 kernel: EDAC k8 MC1: general bus error: participating > processor(local node response), time-out(no timeout) memory transaction > type(generic read), mem or i/o(mem access), cache level(generic) > Aug 14 03:07:31 node1 kernel: EDAC MC1: CE page 0x28b11e, offset 0x308, grain > 8, syndrome 0x2c, row 0, channel 1, label "": k8_edac > Aug 14 03:07:31 node1 kernel: EDAC k8 MC1: extended error code: ECC error > Aug 14 03:09:24 node1 kernel: EDAC k8 MC1: general bus error: participating > processor(local node origin), time-out(no timeout) memory transaction > type(generic read), mem or i/o(mem access), cache level(generic) > Aug 14 03:09:24 node1 kernel: EDAC MC1: CE page 0x28b11e, offset 0x308, grain > 8, syndrome 0x2c, row 0, channel 1, label "": k8_edac > Aug 14 03:09:24 node1 kernel: EDAC k8 MC1: extended error code: ECC error > Aug 14 03:20:26 node1 kernel: EDAC k8 MC1: general bus error: participating > processor(local node origin), time-out(no timeout) memory transaction > type(generic read), mem or i/o(mem access), cache level(generic) > Aug 14 03:20:26 node1 kernel: EDAC MC1: CE page 0x28b11e, offset 0x308, grain > 8, syndrome 0x2c, row 0, channel 1, label "": k8_edac > Aug 14 03:20:26 node1 kernel: EDAC k8 MC1: extended error code: ECC error > > This was still with kernel 8.1.8.el5, if that matters (now I'm running > 8.1.14.el5). For the rest, I have not seen any error messages and the > system seems to work ok. Should I start worrying?
Not a whole lot, no. What you're seeing are correctable memory errors
(CE's) from some of the memory tied to CPU1 (MC${X} = memory controller
on CPU${X}). ECC memory is designed to handle exactly this situation,
EDAC is just letting you know that it happened. If you get excessive
errors, you may well want to replace the memory, since it could be a
sign its starting to go bad, but if its not happening very often, I
wouldn't worry about it much. Now, if you get a UE (uncorrectable memory
error), that's definitely bad...
Nb: in a past life, I worked on large opteron clusters that had EDAC
running on them, and CE's were just a fact of life -- we only replaced
RAM when the number of CE's exceeded a certain threshold, even though I
don't think we ever actually saw any CE's cause a problem. We had the
machines set up to immediately panic if they hit a UE though, since
that's a sure data corrupter.
--
Jarod Wilson
[EMAIL PROTECTED]
signature.asc
Description: OpenPGP digital signature
_______________________________________________ rhelv5-list mailing list rhelv5-list@redhat.com https://www.redhat.com/mailman/listinfo/rhelv5-list