On Mon, Jul 16, 2012 at 10:39:58AM -0700, Trent Nelson wrote:
>
> On Jul 16, 2012, at 12:38 PM, Daniel Shahaf wrote:
>
> > Trent Nelson wrote on Mon, Jul 16, 2012 at 08:58:09 -0700:
> >> Somewhat related: is this a FreeBSD box?
> >
> > Yes, it's eris from http://www.apache.org/dev/machines.
> >
> >> ports/sysutils/mcelog is useful for getting info on any ECC errors
> >> that might have occurred.
> >
> > Thanks for the pointer.  The port description says: "The primary purpose
> > is to provide a way to decode MCE output from the FreeBSD kernel into
> > something more human-readable" --- how to get the "raw" MCE output?  I
> > don't see "mce" mentioned in `sysctl -a` or /var/log/messages.
>
> Yeah it's definitely on the cryptic side.  I'm dubious as to whether or
> not the majority of features mentioned in the man page actually work.
>
> From experience, I simply `pkg_add -r mcelog`'d and then ran `mcelog`
> on a FreeBSD box of mine that looked like it had some wonky DIMMs.  I
> noticed a MCE line in the console log, so I ran `mcelog`, and wallah,
> heaps of info about the error (the exact DIMM/slot was handy).

    Just had a box play up again in the same manner.  For the archives,
    here's the type of message that'll show up on the system console:

MCA: Address 0x7d4ffbcc0
MCA: Bank 1, Status 0xd400400000000853
MCA: Global Cap 0x0000000000000105, Status 0x0000000000000000
MCA: Vendor "AuthenticAMD", ID 0x20f12, APIC ID 7
MCA: CPU 7 COR OVER BUSLG Source IRD Memory
MCA: Address 0x7c3cffe80
MCA: Bank 2, Status 0xd000400000000863
MCA: Global Cap 0x0000000000000105, Status 0x0000000000000000
MCA: Vendor "AuthenticAMD", ID 0x20f12, APIC ID 7
MCA: CPU 7 COR OVER BUSLG Source PREFETCH Memory

    Simply running `mcelog` on that box produces this (snipped the first
    123 examples):

STATUS d471c00000000833 MCGSTATUS 0
MCGCAP 105 APICID 7 SOCKETID 0 
CPUID Vendor AMD Family 15 Model 33
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 124
CPU 7 1 instruction cache TSC 991bb5281cc5 
ADDR 7c3cffe80 
  Instruction cache ECC error
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node origin, request didn't time out
             instruction fetch mem transaction
             memory access, level generic'
STATUS d400400000000853 MCGSTATUS 0
MCGCAP 105 APICID 7 SOCKETID 0 
CPUID Vendor AMD Family 15 Model 33
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 125
CPU 7 2 bus unit TSC 991bb528226c 
  L2 cache ECC error
  Bus or cache array error
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node origin, request didn't time out
             prefetch mem transaction
             memory access, level generic'

    If running `mcelog` doesn't produce anything, it's probably not an
    ECC issue.


        Trent.

(P.S. Anyone in the market for a cheap Sun Fire v40z?  Very reliable!)

Reply via email to