Re: [osol-discuss] DL385 hangs after 90 minutes with B66

Gavin Maltby Mon, 06 Aug 2007 06:01:32 -0700

Hi,

On 08/06/07 13:04, Chris Linton-Ford wrote:

Apologies for the delay in replying - our fileserver died a grisly death
on Friday afternoon.

Could you do the following for me, please:

0) In your BIOS options look for an option to remap/reclaim
    the dram hole, and if it is offering a choice of "software"
    vs "hardware" remapping select hardware.  If that was
    not already selected reboot with that setting and see
    if the hang occurs.  Set the following in /etc/system
    to increase the scrub rate we apply so (if we're guilty)
    you won't have to wait 90 minutes each time:

        set cpu\.AuthenticAMD\.15:ao_scrub_rate_dram=1

No option exists in the BIOS Setup Utility to remap the dram hole, as
far as I could see.


OK.  It appears you do not have the hole remapped either way, anyway.

Adding the above line to /etc/system did indeed reduce the time to crash
- so much so that it didn't get as far as the login prompt :( Hence, as
I didn't have the nous or time to roll the Smart Array drivers into a
Live CD or the failsafe boot environment, I reinstalled.


Sorry, I should have listed a few ways out of that hole.  My bad.

1) Regardless of the results of 0, in mdb dump out the memory
    controller nvlist info and then the full memory controller structure:

mdb -k <<EOM
*mc_list::list mc_t mc_next | ::print mc_t mc_nvl | ::nvlist
*mc_list::list mc_t mc_next | ::print mc_t
EOM


The output is pretty long: instead of attaching it I've put it here:

http://chrislf.freeshell.org/mdb.out


Thanks for this.

You have revision E cpus in two socket 940 sockets (Opteron).
On each node there are 4 dimms, each dimm being single-rank
and each being 512M in size.  On each chip the dimms are
arranged into two 1G chip-selects of 128-bit width.
There is no node interleaving, and the chip-selects
of each node are configured in a two-way chip-select
interleave.

On both nodes the dram hole size refelected in the dram hole
register is zero, so there is no hardware hole reclaim.
The chip-selects are interleaved so there is no possibility
of discontiguous chip-selects (erratum #99 does not apply).
Indeed with 2G per node there is nothing to reclaim - it's
only when the installed memory on a node approaches 4G that
you overlap with the MMIO area and lose access to that dram
that overlaps MMIO.

What stands out here, as it did in the Sun-internal case I had
a look at, is that "bank swizzling" is enabled on this system.
I believe there is no BIOS option to disable it.  Bank swizzling
is an AMD mode in which some row bits are interchanged with
SDRAM internal bank-select bits to change the physical order
in which sdram bits are traversed as physical addresses increase
(this can be a performance win for some access patterns).
That stands out only because I have never seen that enabled
in any lab system, and the only two systems I've seen it
enabled on are these two (yours + internally reported)
systems reporting hangs involving the dram scrubber.
This leads me to suspect that there may be an issue
involving the dram scrubber in the presence of bank
swizzling.

The last I heard from the internal case the HP BIOS team
were going to experiment.  I've just asked for an update.

Could you try one more experiment, please.  Leave the Solaris
/etc/system setting below to stop us enabling the scrubber,
but look for the BIOS option (if any) to enable dram scrubbing.
Note what it is set to before changing it, and then try
enabling the scrubber from there and see if the OS will run.
The BIOS option to enable dram scrubbing sometimes
hides within an option named something like
"ECC protection" which you can set from "none" to "good"
"better" etc - higher levels enable the scrubbers at higher
rates.

If Solaris runs ok with the dram scrubber enabled via BIOS
(I think the hang will likely still occur) then this may be
a case of starting the scrubber incorrectly from Solaris.
Actually all we do is set it going at the node dram base
address, so there is little to get wrong!

2) If 0 did not take care of it, rename the two AMD cpu modules and reboot:

# mv /platform/i86pc/kernel/cpu/cpu.AuthenticAMD.15 \
        /platform/i86pc/kernel/cpu/cpu.AuthenticAMD.15-

# mv /platform/i86pc/kernel/cpu/amd64/cpu.AuthenticAMD.15 \
        /platform/i86pc/kernel/cpu/amd64/cpu.AuthenticAMD.15-

# init 6


This worked.

OK.

That's a bit heavy-handed since it eliminates all the config
operations we perform and not just the scrubber stuff.  We will
fallback to the dumb generic support.

3) If 2) appears to let you survive longer than 90 minutes you can
    add the following to /etc/system as a workaround:

        set cpu\.AuthenticAMD\.15:ao_scrub_policy=1
        set cpu\.AuthenticAMD\.15:ao_scrub_rate_dram=0

    which will stop us enabling the dram scrubber.  If you
    set the dram scrub rate in 0) above to 1 be sure to replace
    that line with the 0 setting above.


This also worked fine! Let me know if you would like further details of
my hardware setup.


So for now I'd suggest running with the cpu.AuthenticAMD.15 modules in
place, and with the above /etc/system setting.

Thanks

Gavin

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
opensolaris-discuss mailing list
[email protected]

Re: [osol-discuss] DL385 hangs after 90 minutes with B66

Reply via email to