Re: [osol-discuss] DL385 hangs after 90 minutes with B66

Chris Linton-Ford Thu, 02 Aug 2007 10:06:49 -0700

On Thu, 2007-08-02 at 17:22 +0100, Gavin Maltby wrote:
> On 08/02/07 15:31, Boyd Adamson wrote:
> > Chris Linton-Ford <[EMAIL PROTECTED]> writes:
> > 
> >> Hi all,
> >>
> >> I am trying to get SXCR B66 running on an HP DL385 G1 with 3x300G disks
> >> on a Smart Array 6i controller. I have managed to get the operating
> >> system installed several times; I use the HP-supplied CPQary3 drivers
> >> and the system appears to install fine. The only slight wrinkle is that
> >> the /etc/inet/hosts file does not get an entry for the system's hostname
> >> during installation, so I need to add that manually after the first
> >> boot. Everything else works as normal.
> >>
> >> The problem is after about 90 minutes of uptime, the system becomes
> >> completely unresponsive; network requests are left hanging and the
> >> console does not echo typed characters. The console cursor is still
> >> blinking. To reboot requires a hard reset, and after another 90 minutes
> >> the same thing happens again.
> 
> > Wasn't there something about the memory scrubber on systems like this
> > that caused these symptoms?
> 
> Yes, but I fixed that back in build 51.  This was AMD erratum 99 in which
> the dram scrubber should not be enabled if there is a dram hole created
> using chip-select hoisting - it tries to read into the hole and that
> can cause hard hangs etc.  The erratum applied to revision D and
> earlier cpus - on rev E and later a decicated dram hole register
> allows remapping the hole without discontiguous chip-select ranges.
> 
> HP confirmed the fix at the time, but I have had one report that
> there has been a similar problem repeated since (I think with
> Solaris 10 Update 4 which has the fix); now your report.
> I think in the previous one we eliminated the fault management
> software by disabling it, so I suspect another issue perhaps.
> But the 90 minutes is ominous - that is around the time it
> takes to scrub 2GB at the default scrub rate we set in Solaris
> (which works out to around 1GB every 45 minutes).  Some BIOS
> offer both forms of remapping the hole even on rev E and later
> parts.
> 
> Could you do the following for me, please:
> 
> 0) In your BIOS options look for an option to remap/reclaim
>     the dram hole, and if it is offering a choice of "software"
>     vs "hardware" remapping select hardware.  If that was
>     not already selected reboot with that setting and see
>     if the hang occurs.  Set the following in /etc/system
>     to increase the scrub rate we apply so (if we're guilty)
>     you won't have to wait 90 minutes each time:
> 
>       set cpu\.AuthenticAMD\.15:ao_scrub_rate_dram=1
> 
> 1) Regardless of the results of 0, in mdb dump out the memory
>     controller nvlist info and then the full memory controller structure:
> 
> mdb -k <<EOM
> *mc_list::list mc_t mc_next | ::print mc_t mc_nvl | ::nvlist
> *mc_list::list mc_t mc_next | ::print mc_t
> EOM
> 
> 2) If 0 did not take care of it, rename the two AMD cpu modules and reboot:
> 
> # mv /platform/i86pc/kernel/cpu/cpu.AuthenticAMD.15 \
>       /platform/i86pc/kernel/cpu/cpu.AuthenticAMD.15-
> 
> # mv /platform/i86pc/kernel/cpu/amd64/cpu.AuthenticAMD.15 \
>       /platform/i86pc/kernel/cpu/amd64/cpu.AuthenticAMD.15-
> 
> # init 6
> 
> That's a bit heavy-handed since it eliminates all the config
> operations we perform and not just the scrubber stuff.  We will
> fallback to the dumb generic support.
> 
> 3) If 2) appears to let you survive longer than 90 minutes you can
>     add the following to /etc/system as a workaround:
> 
>       set cpu\.AuthenticAMD\.15:ao_scrub_policy=1
>       set cpu\.AuthenticAMD\.15:ao_scrub_rate_dram=0
> 
>     which will stop us enabling the dram scrubber.  If you
>     set the dram scrub rate in 0) above to 1 be sure to replace
>     that line with the 0 setting above.
> 
> In all cases do not leave the scrub rate setting of 1 (maximum rate)
> that you may have set in step 0 - that scrubs at around 1G every
> 0.66s and you *will* notice the performance impact!
> 
> Thanks
> 
> Gavin
> 
> 
Thanks Gavin,


Following the first reply I tried the workaround mentioned in
http://mail.opensolaris.org/pipermail/opensolaris-bugs/2006-September/000676.html,
 which seems to have worked (2:20 hours uptime so far); but I'll follow your 
instructions in full and report back with the results first thing tomorrow.

Cheers,

Chris

_______________________________________________
opensolaris-discuss mailing list
[email protected]

Re: [osol-discuss] DL385 hangs after 90 minutes with B66

Reply via email to