On Mon, Jul 16, 2012 at 10:39:58AM -0700, Trent Nelson wrote: > > On Jul 16, 2012, at 12:38 PM, Daniel Shahaf wrote: > > > Trent Nelson wrote on Mon, Jul 16, 2012 at 08:58:09 -0700: > >> Somewhat related: is this a FreeBSD box? > > > > Yes, it's eris from http://www.apache.org/dev/machines. > > > >> ports/sysutils/mcelog is useful for getting info on any ECC errors > >> that might have occurred. > > > > Thanks for the pointer. The port description says: "The primary purpose > > is to provide a way to decode MCE output from the FreeBSD kernel into > > something more human-readable" --- how to get the "raw" MCE output? I > > don't see "mce" mentioned in `sysctl -a` or /var/log/messages. > > Yeah it's definitely on the cryptic side. I'm dubious as to whether or > not the majority of features mentioned in the man page actually work. > > From experience, I simply `pkg_add -r mcelog`'d and then ran `mcelog` > on a FreeBSD box of mine that looked like it had some wonky DIMMs. I > noticed a MCE line in the console log, so I ran `mcelog`, and wallah, > heaps of info about the error (the exact DIMM/slot was handy).
Just had a box play up again in the same manner. For the archives, here's the type of message that'll show up on the system console: MCA: Address 0x7d4ffbcc0 MCA: Bank 1, Status 0xd400400000000853 MCA: Global Cap 0x0000000000000105, Status 0x0000000000000000 MCA: Vendor "AuthenticAMD", ID 0x20f12, APIC ID 7 MCA: CPU 7 COR OVER BUSLG Source IRD Memory MCA: Address 0x7c3cffe80 MCA: Bank 2, Status 0xd000400000000863 MCA: Global Cap 0x0000000000000105, Status 0x0000000000000000 MCA: Vendor "AuthenticAMD", ID 0x20f12, APIC ID 7 MCA: CPU 7 COR OVER BUSLG Source PREFETCH Memory Simply running `mcelog` on that box produces this (snipped the first 123 examples): STATUS d471c00000000833 MCGSTATUS 0 MCGCAP 105 APICID 7 SOCKETID 0 CPUID Vendor AMD Family 15 Model 33 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor MCE 124 CPU 7 1 instruction cache TSC 991bb5281cc5 ADDR 7c3cffe80 Instruction cache ECC error bit46 = corrected ecc error bit62 = error overflow (multiple errors) bus error 'local node origin, request didn't time out instruction fetch mem transaction memory access, level generic' STATUS d400400000000853 MCGSTATUS 0 MCGCAP 105 APICID 7 SOCKETID 0 CPUID Vendor AMD Family 15 Model 33 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor MCE 125 CPU 7 2 bus unit TSC 991bb528226c L2 cache ECC error Bus or cache array error bit46 = corrected ecc error bit62 = error overflow (multiple errors) bus error 'local node origin, request didn't time out prefetch mem transaction memory access, level generic' If running `mcelog` doesn't produce anything, it's probably not an ECC issue. Trent. (P.S. Anyone in the market for a cheap Sun Fire v40z? Very reliable!)