On 08/02/07 15:31, Boyd Adamson wrote:
Chris Linton-Ford <[EMAIL PROTECTED]> writes:

Hi all,

I am trying to get SXCR B66 running on an HP DL385 G1 with 3x300G disks
on a Smart Array 6i controller. I have managed to get the operating
system installed several times; I use the HP-supplied CPQary3 drivers
and the system appears to install fine. The only slight wrinkle is that
the /etc/inet/hosts file does not get an entry for the system's hostname
during installation, so I need to add that manually after the first
boot. Everything else works as normal.

The problem is after about 90 minutes of uptime, the system becomes
completely unresponsive; network requests are left hanging and the
console does not echo typed characters. The console cursor is still
blinking. To reboot requires a hard reset, and after another 90 minutes
the same thing happens again.

Wasn't there something about the memory scrubber on systems like this
that caused these symptoms?

Yes, but I fixed that back in build 51.  This was AMD erratum 99 in which
the dram scrubber should not be enabled if there is a dram hole created
using chip-select hoisting - it tries to read into the hole and that
can cause hard hangs etc.  The erratum applied to revision D and
earlier cpus - on rev E and later a decicated dram hole register
allows remapping the hole without discontiguous chip-select ranges.

HP confirmed the fix at the time, but I have had one report that
there has been a similar problem repeated since (I think with
Solaris 10 Update 4 which has the fix); now your report.
I think in the previous one we eliminated the fault management
software by disabling it, so I suspect another issue perhaps.
But the 90 minutes is ominous - that is around the time it
takes to scrub 2GB at the default scrub rate we set in Solaris
(which works out to around 1GB every 45 minutes).  Some BIOS
offer both forms of remapping the hole even on rev E and later
parts.

Could you do the following for me, please:

0) In your BIOS options look for an option to remap/reclaim
   the dram hole, and if it is offering a choice of "software"
   vs "hardware" remapping select hardware.  If that was
   not already selected reboot with that setting and see
   if the hang occurs.  Set the following in /etc/system
   to increase the scrub rate we apply so (if we're guilty)
   you won't have to wait 90 minutes each time:

        set cpu\.AuthenticAMD\.15:ao_scrub_rate_dram=1

1) Regardless of the results of 0, in mdb dump out the memory
   controller nvlist info and then the full memory controller structure:

mdb -k <<EOM
*mc_list::list mc_t mc_next | ::print mc_t mc_nvl | ::nvlist
*mc_list::list mc_t mc_next | ::print mc_t
EOM

2) If 0 did not take care of it, rename the two AMD cpu modules and reboot:

# mv /platform/i86pc/kernel/cpu/cpu.AuthenticAMD.15 \
        /platform/i86pc/kernel/cpu/cpu.AuthenticAMD.15-

# mv /platform/i86pc/kernel/cpu/amd64/cpu.AuthenticAMD.15 \
        /platform/i86pc/kernel/cpu/amd64/cpu.AuthenticAMD.15-

# init 6

That's a bit heavy-handed since it eliminates all the config
operations we perform and not just the scrubber stuff.  We will
fallback to the dumb generic support.

3) If 2) appears to let you survive longer than 90 minutes you can
   add the following to /etc/system as a workaround:

        set cpu\.AuthenticAMD\.15:ao_scrub_policy=1
        set cpu\.AuthenticAMD\.15:ao_scrub_rate_dram=0

   which will stop us enabling the dram scrubber.  If you
   set the dram scrub rate in 0) above to 1 be sure to replace
   that line with the 0 setting above.

In all cases do not leave the scrub rate setting of 1 (maximum rate)
that you may have set in step 0 - that scrubs at around 1G every
0.66s and you *will* notice the performance impact!

Thanks

Gavin


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
opensolaris-discuss mailing list
[email protected]

Reply via email to