On Thu, 2007-08-02 at 17:22 +0100, Gavin Maltby wrote: > On 08/02/07 15:31, Boyd Adamson wrote: > > Chris Linton-Ford <[EMAIL PROTECTED]> writes: > > > >> Hi all, > >> > >> I am trying to get SXCR B66 running on an HP DL385 G1 with 3x300G disks > >> on a Smart Array 6i controller. I have managed to get the operating > >> system installed several times; I use the HP-supplied CPQary3 drivers > >> and the system appears to install fine. The only slight wrinkle is that > >> the /etc/inet/hosts file does not get an entry for the system's hostname > >> during installation, so I need to add that manually after the first > >> boot. Everything else works as normal. > >> > >> The problem is after about 90 minutes of uptime, the system becomes > >> completely unresponsive; network requests are left hanging and the > >> console does not echo typed characters. The console cursor is still > >> blinking. To reboot requires a hard reset, and after another 90 minutes > >> the same thing happens again. > > > Wasn't there something about the memory scrubber on systems like this > > that caused these symptoms? > > Yes, but I fixed that back in build 51. This was AMD erratum 99 in which > the dram scrubber should not be enabled if there is a dram hole created > using chip-select hoisting - it tries to read into the hole and that > can cause hard hangs etc. The erratum applied to revision D and > earlier cpus - on rev E and later a decicated dram hole register > allows remapping the hole without discontiguous chip-select ranges. > > HP confirmed the fix at the time, but I have had one report that > there has been a similar problem repeated since (I think with > Solaris 10 Update 4 which has the fix); now your report. > I think in the previous one we eliminated the fault management > software by disabling it, so I suspect another issue perhaps. > But the 90 minutes is ominous - that is around the time it > takes to scrub 2GB at the default scrub rate we set in Solaris > (which works out to around 1GB every 45 minutes). Some BIOS > offer both forms of remapping the hole even on rev E and later > parts. > > Could you do the following for me, please: > > 0) In your BIOS options look for an option to remap/reclaim > the dram hole, and if it is offering a choice of "software" > vs "hardware" remapping select hardware. If that was > not already selected reboot with that setting and see > if the hang occurs. Set the following in /etc/system > to increase the scrub rate we apply so (if we're guilty) > you won't have to wait 90 minutes each time: > > set cpu\.AuthenticAMD\.15:ao_scrub_rate_dram=1 > > 1) Regardless of the results of 0, in mdb dump out the memory > controller nvlist info and then the full memory controller structure: > > mdb -k <<EOM > *mc_list::list mc_t mc_next | ::print mc_t mc_nvl | ::nvlist > *mc_list::list mc_t mc_next | ::print mc_t > EOM > > 2) If 0 did not take care of it, rename the two AMD cpu modules and reboot: > > # mv /platform/i86pc/kernel/cpu/cpu.AuthenticAMD.15 \ > /platform/i86pc/kernel/cpu/cpu.AuthenticAMD.15- > > # mv /platform/i86pc/kernel/cpu/amd64/cpu.AuthenticAMD.15 \ > /platform/i86pc/kernel/cpu/amd64/cpu.AuthenticAMD.15- > > # init 6 > > That's a bit heavy-handed since it eliminates all the config > operations we perform and not just the scrubber stuff. We will > fallback to the dumb generic support. > > 3) If 2) appears to let you survive longer than 90 minutes you can > add the following to /etc/system as a workaround: > > set cpu\.AuthenticAMD\.15:ao_scrub_policy=1 > set cpu\.AuthenticAMD\.15:ao_scrub_rate_dram=0 > > which will stop us enabling the dram scrubber. If you > set the dram scrub rate in 0) above to 1 be sure to replace > that line with the 0 setting above. > > In all cases do not leave the scrub rate setting of 1 (maximum rate) > that you may have set in step 0 - that scrubs at around 1G every > 0.66s and you *will* notice the performance impact! > > Thanks > > Gavin > > Thanks Gavin,
Following the first reply I tried the workaround mentioned in http://mail.opensolaris.org/pipermail/opensolaris-bugs/2006-September/000676.html, which seems to have worked (2:20 hours uptime so far); but I'll follow your instructions in full and report back with the results first thing tomorrow. Cheers, Chris _______________________________________________ opensolaris-discuss mailing list [email protected]
