Over the past month I have learned [the hard way] that many PCs are not 
entirely reliable.  Of course I knew this, but nothing like experience drives 
it home.  Evidently the memory access patterns and frequency enables OS's and 
applications to work remarkably well with unreliable hardware.

I would suggest running memtest86 for at least several days, and at least 
several hundred iterations, before assuming that your problems are due to bugs 
in OpenSolaris.  I have seen errors that pop up after of course a random time, 
but on the order or 3 or 4 days.  I gave mine 10 days at 10% overclocking 
before I declared it good.  If this seems intolerably time consuming, consider 
the time wasted chasing ghosts, or dealing with lost data.

If your hardware is capable, preferably overclocked 5 or 10%, so that when you 
set it back to the spec'ed speed you can have confidence that you have timing 
margin.  Or under clock it by 5 or 10% after testing at the spec'ed speed.  10% 
is not going to kill anybody, but being flaky is.  

When you adjust the clock rate in your test condition vs. your run condition, 
ideally it would be ALL clocks:  CPU, FSB, PCI, Memory, etc.  All need to have 
timing margin so as to not run on the hairy edge.  If you cannot adjust your 
clock rates, perhaps you can in manual mode and run them at minimums for test 
(e.g. 2 cycles) and run normally at 3.  The processor cache will mitigate the 
effect on performance.  

If you do all this, you will have a system that runs 5 or 10% slower but that 
you know you can depend on.

Note that you may also find that the unreliability is not related to clock 
speed.  In this case I don't have any advise, but you can at least be aware to 
look to your hardware, not to OpenSolaris.  Also be aware that one can wear out 
sockets swapping and rearranging parts trying to isolate the root-cause.  You 
get a small few with that nice tight crunch, then perhaps a dozen or two more 
before you are approaching the danger zone.

I suggest that a note to this effect should be "sticky'ed" to the top of the 
"help" list with a link to memtest86.  I tried the two mainline and a couple of 
off-shoot versions and in all cases all of them gave the same result.  Maybe it 
should be part of the distribution CD, as the default bootable as a 
not-so-subtle hint!

Thank you all for all of your patience and advise through this.  I have tried 
to tag the appropriate threads where I was using flakey hardware.  I 
specifically want to note that my threads in early December regarding ZFS were 
with what I now have proof IS RELIABLE hardware.  Those were diagnosed 
correcting in that huge thread.

--Ray
-- 
This message posted from opensolaris.org

Reply via email to