On Wed, 13 Jan 1999, Robert G. Brown wrote:
> On Wed, 13 Jan 1999, Jean SCHMITTBUHL wrote:
>
> > Hello,
> >
> > I read with a very high interest your emails on the SMP list about
> > problems you met with your dual 450 MHz PII. I bought a dual 450 MHZ PII
> > last October and I have problems that look very similar to yours. My
> > machine reboot spontaneously when it is loaded by some jobs. However it is
> > perfectly stable with out any or little load.
> > My configuration is:
> > 2x450 MHz PII
> > Asus P2BDS MB
> > 512 Mb
> > Viking Scsi disk 4Go
> > ATI video card
> > 3c900 ethernet card
> >
> > When I underclock the bus frequency to 83 Mhz instead of 100Mhz it works
> > perfectly. An exchange of the processors at 100 Mhz does not change the
> > unstability problem. Did you try to underclock yours ? Did you solve you
> > stability problem ?
>
> I should answer this briefly for the whole group, since the answer is
> yes. We yanked the expensive 250 MB SDRAM DIMMS we got from the vendor
> ("registered" PC100 and all) and replaced it with over-the-counter 128
> MB SDRAM PC100 DIMMS from a local vendor and have never looked back.
> Totally stable. From some of the suggestions I got, it may be that the
> P6DBS has trouble with 256 MB DIMMS, or it may be that the memory we got
> was incidentally crap, but I don't have evidence addressing that point
> as we decided to just add nice cheap 128 MB DIMMS as our jobs demanded
> them rather than load up to 512 MB a priori anyway.
do you have a local dealer with a SIMM/DIMM tester? We ran into problems
like this last year. Dropping a 60ns DIMM into the tester would get "safe
at 73ns" and nonsense like that. And that obviously won't work when the
timing is tight. We use Kingston and don't have problems.
>
> We tried a whole range of things suggested by the group, however, and
> all put together they form a veritable hardware debugging manual that is
> well worth adding to the smp FAQ. Something like:
>
> a) Removal of all cards but an SVGA card and testing to failure.
> Swapping the SVGA card with a completely different one and testing to
> failure. (If it stabilizes on this step, add back cards until point of
> failure is found and deal with it by replacement or hacking the driver
> or whatever).
>
> b) Setting the motherboard speed (and hence both CPU and memory
> speed) down to the next accessible quantum, in our case 300 MHz. In our
> case, this stabilized the system! This strongly suggested a problem
> with the CPUs (possible forgery? possible overclocks?), the memory (bad
> memory) or the motherboard itself (yuk! the one part that is really hard
> to swap out).
this is becoming more difficult to do, and will likely be totally
impossible within another year or two. Intel is cracking down on the
overclocking stuff, by locking the processor speeds, and they are
rumored to be planning on locking the bus speed as well, so that
overclocking isn't going to happen.
>
> c) Removal of one CPU (back at full speed), then the other and
> testing to failure. Removal of all the memory but one DIMM, testing to
> failure. Swapping for the other DIMM(s) and testing to failure (binary
> search optional if 3 or more DIMMS). None of this stabilized our
> system. We checked the CPU's carefully for forgery (apparently a
> significant problem, see list archives) but Intel verified their SN's.
> Problem almost certainly in Motherboard itself or memory subsystem, but
> not in either particular memory chip.
>
> d) Finally, we bought aforementioned DIMM, swapped it in, and system
> was stable with dual CPUs at full speed. Sent back (expensive!) 250 MB
> DIMMS to vendor (Aberdeen, Inc.) with irate letter. Long ago vowed
> never to do business with Aberdeen again for other reasons (like the
> utter impossibility of getting service or even attention from them
> unless we write the president of the company PERSONALLY -- fortunately I
> do indeed have his email address;-), this reinforced the decision.
>
> e) Don't know what stage to put this, but I've found putting an
> actual multimeter on the power supply to be helpful in the past, as well
> as a careful check of its rated peak current/power. Some systems,
> especially ATX systems, won't start unless the power supply can provide
> enough current at startup, and vendors sometimes load in the cards or
> peripherals after burning in the motherboard (idiots!) and don't realize
> that a system built with a cheap PS won't boot. In our case, I was
> running the lm-sensors package and already knew the core voltages and
> temperatures to be in the nominal range.
>
a 'scope is better. You want a smooth output, not something with a
lot of 60 hz ripple scattered everywhere. We had one of these in a Sun
a couple of years ago and it caused random and intermittent failures that
were a pain to find.
> f) Don't know what stage to put this either, as the SMP kernels are
> pretty reliable these days, but it is alwasy a good idea to try SMP and
> UP kernels, and to build "minimal device" kernels and try them to see if
> the system stabilizes as well to eliminate software as a source of
> difficulty. These days I find hardware MUCH more likely to be the point
> of failure if it is something mysterious. Problems with device drivers
> are usually fairly obvious and fixed by the time you get a single clean
> boot.
>
> SO, I don't want to suggest that your problem is certainly memory or
> anything like that, but the protocol above might be useful to you and
> anyone else. Perhaps it could get added to the linux-SMP FAQ?
>
I think one good point is to always start with 'quality' parts. IE we
bought a bunch of Sun supersparcs last year. We have had three fail
completely already, all three being memory problems. All three had
memory by (I think) samson. When we have tried PC's (pentium pros and
up) using them in heavy-duty applications like file servers and the like,
we have had problems with off-brand memory being way out of spec. IE
just because it says PC100 doesn't mean it works at PC100 speeds. Or
just because it says 60ns on the actual chips, doesn't mean it will
safely/reliably operate at 60ns. We've been considering buying a tester
ourselves we have so many machines scattered around.
> rgb
>
> Robert G. Brown http://www.phy.duke.edu/~rgb/
> Duke University Dept. of Physics, Box 90305
> Durham, N.C. 27708-0305
> Phone: 1-919-660-2567 Fax: 919-660-2525 email:[EMAIL PROTECTED]
>
>
>
> -
> Linux SMP list: FIRST see FAQ at http://www.irisa.fr/prive/mentre/smp-faq/
> To Unsubscribe: send "unsubscribe linux-smp" to [EMAIL PROTECTED]
>
-
Linux SMP list: FIRST see FAQ at http://www.irisa.fr/prive/mentre/smp-faq/
To Unsubscribe: send "unsubscribe linux-smp" to [EMAIL PROTECTED]