Hi, On 08/06/07 13:04, Chris Linton-Ford wrote:
Apologies for the delay in replying - our fileserver died a grisly death on Friday afternoon.Could you do the following for me, please: 0) In your BIOS options look for an option to remap/reclaim the dram hole, and if it is offering a choice of "software" vs "hardware" remapping select hardware. If that was not already selected reboot with that setting and see if the hang occurs. Set the following in /etc/system to increase the scrub rate we apply so (if we're guilty) you won't have to wait 90 minutes each time: set cpu\.AuthenticAMD\.15:ao_scrub_rate_dram=1No option exists in the BIOS Setup Utility to remap the dram hole, as far as I could see.
OK. It appears you do not have the hole remapped either way, anyway.
Adding the above line to /etc/system did indeed reduce the time to crash - so much so that it didn't get as far as the login prompt :( Hence, as I didn't have the nous or time to roll the Smart Array drivers into a Live CD or the failsafe boot environment, I reinstalled.
Sorry, I should have listed a few ways out of that hole. My bad.
1) Regardless of the results of 0, in mdb dump out the memory controller nvlist info and then the full memory controller structure: mdb -k <<EOM *mc_list::list mc_t mc_next | ::print mc_t mc_nvl | ::nvlist *mc_list::list mc_t mc_next | ::print mc_t EOMThe output is pretty long: instead of attaching it I've put it here: http://chrislf.freeshell.org/mdb.out
Thanks for this. You have revision E cpus in two socket 940 sockets (Opteron). On each node there are 4 dimms, each dimm being single-rank and each being 512M in size. On each chip the dimms are arranged into two 1G chip-selects of 128-bit width. There is no node interleaving, and the chip-selects of each node are configured in a two-way chip-select interleave. On both nodes the dram hole size refelected in the dram hole register is zero, so there is no hardware hole reclaim. The chip-selects are interleaved so there is no possibility of discontiguous chip-selects (erratum #99 does not apply). Indeed with 2G per node there is nothing to reclaim - it's only when the installed memory on a node approaches 4G that you overlap with the MMIO area and lose access to that dram that overlaps MMIO. What stands out here, as it did in the Sun-internal case I had a look at, is that "bank swizzling" is enabled on this system. I believe there is no BIOS option to disable it. Bank swizzling is an AMD mode in which some row bits are interchanged with SDRAM internal bank-select bits to change the physical order in which sdram bits are traversed as physical addresses increase (this can be a performance win for some access patterns). That stands out only because I have never seen that enabled in any lab system, and the only two systems I've seen it enabled on are these two (yours + internally reported) systems reporting hangs involving the dram scrubber. This leads me to suspect that there may be an issue involving the dram scrubber in the presence of bank swizzling. The last I heard from the internal case the HP BIOS team were going to experiment. I've just asked for an update. Could you try one more experiment, please. Leave the Solaris /etc/system setting below to stop us enabling the scrubber, but look for the BIOS option (if any) to enable dram scrubbing. Note what it is set to before changing it, and then try enabling the scrubber from there and see if the OS will run. The BIOS option to enable dram scrubbing sometimes hides within an option named something like "ECC protection" which you can set from "none" to "good" "better" etc - higher levels enable the scrubbers at higher rates. If Solaris runs ok with the dram scrubber enabled via BIOS (I think the hang will likely still occur) then this may be a case of starting the scrubber incorrectly from Solaris. Actually all we do is set it going at the node dram base address, so there is little to get wrong!
2) If 0 did not take care of it, rename the two AMD cpu modules and reboot: # mv /platform/i86pc/kernel/cpu/cpu.AuthenticAMD.15 \ /platform/i86pc/kernel/cpu/cpu.AuthenticAMD.15- # mv /platform/i86pc/kernel/cpu/amd64/cpu.AuthenticAMD.15 \ /platform/i86pc/kernel/cpu/amd64/cpu.AuthenticAMD.15- # init 6This worked.
OK.
That's a bit heavy-handed since it eliminates all the config operations we perform and not just the scrubber stuff. We will fallback to the dumb generic support. 3) If 2) appears to let you survive longer than 90 minutes you can add the following to /etc/system as a workaround: set cpu\.AuthenticAMD\.15:ao_scrub_policy=1 set cpu\.AuthenticAMD\.15:ao_scrub_rate_dram=0 which will stop us enabling the dram scrubber. If you set the dram scrub rate in 0) above to 1 be sure to replace that line with the 0 setting above.This also worked fine! Let me know if you would like further details of my hardware setup.
So for now I'd suggest running with the cpu.AuthenticAMD.15 modules in place, and with the above /etc/system setting. Thanks Gavin
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ opensolaris-discuss mailing list [email protected]
