On 08/02/07 15:31, Boyd Adamson wrote:
Chris Linton-Ford <[EMAIL PROTECTED]> writes:Hi all, I am trying to get SXCR B66 running on an HP DL385 G1 with 3x300G disks on a Smart Array 6i controller. I have managed to get the operating system installed several times; I use the HP-supplied CPQary3 drivers and the system appears to install fine. The only slight wrinkle is that the /etc/inet/hosts file does not get an entry for the system's hostname during installation, so I need to add that manually after the first boot. Everything else works as normal. The problem is after about 90 minutes of uptime, the system becomes completely unresponsive; network requests are left hanging and the console does not echo typed characters. The console cursor is still blinking. To reboot requires a hard reset, and after another 90 minutes the same thing happens again.
Wasn't there something about the memory scrubber on systems like this that caused these symptoms?
Yes, but I fixed that back in build 51. This was AMD erratum 99 in which
the dram scrubber should not be enabled if there is a dram hole created
using chip-select hoisting - it tries to read into the hole and that
can cause hard hangs etc. The erratum applied to revision D and
earlier cpus - on rev E and later a decicated dram hole register
allows remapping the hole without discontiguous chip-select ranges.
HP confirmed the fix at the time, but I have had one report that
there has been a similar problem repeated since (I think with
Solaris 10 Update 4 which has the fix); now your report.
I think in the previous one we eliminated the fault management
software by disabling it, so I suspect another issue perhaps.
But the 90 minutes is ominous - that is around the time it
takes to scrub 2GB at the default scrub rate we set in Solaris
(which works out to around 1GB every 45 minutes). Some BIOS
offer both forms of remapping the hole even on rev E and later
parts.
Could you do the following for me, please:
0) In your BIOS options look for an option to remap/reclaim
the dram hole, and if it is offering a choice of "software"
vs "hardware" remapping select hardware. If that was
not already selected reboot with that setting and see
if the hang occurs. Set the following in /etc/system
to increase the scrub rate we apply so (if we're guilty)
you won't have to wait 90 minutes each time:
set cpu\.AuthenticAMD\.15:ao_scrub_rate_dram=1
1) Regardless of the results of 0, in mdb dump out the memory
controller nvlist info and then the full memory controller structure:
mdb -k <<EOM
*mc_list::list mc_t mc_next | ::print mc_t mc_nvl | ::nvlist
*mc_list::list mc_t mc_next | ::print mc_t
EOM
2) If 0 did not take care of it, rename the two AMD cpu modules and reboot:
# mv /platform/i86pc/kernel/cpu/cpu.AuthenticAMD.15 \
/platform/i86pc/kernel/cpu/cpu.AuthenticAMD.15-
# mv /platform/i86pc/kernel/cpu/amd64/cpu.AuthenticAMD.15 \
/platform/i86pc/kernel/cpu/amd64/cpu.AuthenticAMD.15-
# init 6
That's a bit heavy-handed since it eliminates all the config
operations we perform and not just the scrubber stuff. We will
fallback to the dumb generic support.
3) If 2) appears to let you survive longer than 90 minutes you can
add the following to /etc/system as a workaround:
set cpu\.AuthenticAMD\.15:ao_scrub_policy=1
set cpu\.AuthenticAMD\.15:ao_scrub_rate_dram=0
which will stop us enabling the dram scrubber. If you
set the dram scrub rate in 0) above to 1 be sure to replace
that line with the 0 setting above.
In all cases do not leave the scrub rate setting of 1 (maximum rate)
that you may have set in step 0 - that scrubs at around 1G every
0.66s and you *will* notice the performance impact!
Thanks
Gavin
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ opensolaris-discuss mailing list [email protected]
