scott.marlowe wrote:

On Tue, 30 Mar 2004, Andrew Biagioni wrote:


Alex,

the answer is "no" to all of these. We are a tiny start-up (2 guys, and we do our own cleaning); ambient temperature varies significantly but is not related to the failure, and one machine starts beeping when it gets too hot (then we added an extra case fan); no fancy watchdogs (maybe someday... One can only dream :-> ); three different cases, power supplies, motherboards, etc., etc. (one power supply is extra-large, and that's the machine that started failing first!).

We originally blamed the problem on hardware failure (first machine); then on OS version/configuration (second machine); now we're out of things to blame, except maybe unusually bad luck...


What did memtest86 say?

Did the same person build all the machines? I've seen plenty of folks build machines and zap the memory when installing it. >95% of all ESD failures are partial / delayed failures, so just because a computer boots up doesn't mean proper ESD procedures were followed, and if not, and if you're in a dry environment like I am (I live in Denver) then it's quite possible all three have bad CPU/mobo/memory or something like that.

Two different people built the machines; we're both electrical engineers with plenty of familiarity and experience with static issues, so that particular issue is not likely.


As for memtest86 - I haven't been able to run it on two of the machines yet (they are in production), and I have to restart the third one (it was "retired" after the third time it died on us).

Meanwhile I found out some more details:
- the first machine had a software raid system that may have been unreliable
- the second machine had a much older kernel and sloppily-updated modules, and it would hang -- not reboot
- the last machine to reboot MAY have been a line power issue (the whole building lost power a few hours later, so I lost some info on other machines' restarting -- I'll dig more).


So -- it's memtest86 and badblocks for all three (as soon as I can), better UPS-ing, updated kernel(s), and checking more machines' logs; then we'll see...

Thanks to you all for the suggestions -- keep them coming!

Andrew


---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend

Reply via email to