Salut Velu!

Hey bruce, World is small isn't it ;))

Yup. (Actually if you look back through the archives you'll find the first annoucment of smartmontools was on this list.)

[...]
I would appreciate advice about:
  -- how to configure these settings
  -- pointers to relevant AMD/Serverworks documentation
  -- relevant Linux kernel options/modules
  -- anything else relevant/related

You cand find some documentation on this project : http://bluesmoke.sourceforge.net/ or the older http://www.anime.net/~goemon/linux-ecc/

I've been corresponding off-list with Mark Langsdorf. He's an AMD employee who works on Linux tools and implementation, hangs out on the LKML, and submits kernel patches from AMD. Mark said that the 'bluesmoke' functionality is only needed with 2.4 kernels. With 2.6 kernels you just install 'mcelog' and that's everything that's needed.

Mark also said that the mapping between CPUID and chipid needs to be correlated with DIMM slot on a case-by-case basis. One way (which Mark does NOT recommend!) is to heat each DIMM with a heat gun, or mask off a single bit on the connector, to generate errors from that DIMM. This makes sense for people on this list who will have dozens or hundreds of the same box and want to understand this relationship.

EDAC sounds to be on the way to be integrated upstream (http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=806c35f5057a64d3061ee4e2b1023bf6f6d328e2). This sounds to be some preliminary work but you may give it a try. *I don't know your configuration but the "drivers/edac/amd76x_edac.c" may match. I didn't had time to test EDAC but if you will, I'm interested in your results.

I'll report back to the list whether mcelog is enough, or whether we also needed to install other drivers to get ECC reporting.

Mark also provided advice about the other ECC settings. I'll copy it verbatim to the list. Mark wrote:

You'll want to look at chapter 3 (Memory System) of the BKDG (AMD 64 BIOS AND KERNEL DEVELOPERS GUIDE). Here's the recommended settings:

   ECC enable
                Enable

   MCA DRAM ECC logging
                Enable

   ECC Chip Kill
                Enable if using x4 DIMMs

   DRAM Scrub Redirect
                Enable

   DRAM BG Scrub
                set as high as possible (84 ms is maximum)

   L2 Cache BG Scrub
                not DRAM related

   Data Cache BG Scrub
                not DRAM related

[Note from Bruce: can anyone on the list make recommendations about this last two, non-DRAM-related SCRUB settings??]

I also asked Mark:

Am I correct that there is nothing in the Linux kernel which
modifies the  machine registers which determine ECC behavior,
so I have to depend upon the BIOS to initialize/configure
these registers as I want?

He replied:

As far as I know, it's BIOS set-up only.  Linux tries to avoid
knowing the details of the DRAM set-up, and there's a limit to
how much the OS can modify anyway.  Linux can set bits to
determine what MCEs cause exceptions, but it can't enable the
DRAM scrubber, for example.

Cheers,
        Bruce
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to