I assume you've done this but forgot to mention it in the email - did you test the RAM?
-Jack Carrozzo On Wed, Dec 16, 2009 at 5:27 PM, David Mathog <[email protected]> wrote: > So we have a cluster of Tyan S2466 nodes and one of them has failed in > an odd way. (Yes, these are very old, and they would be gone if we had a > replacment.) On applying power the system boots normally and gets far > into the boot sequence, sometimes to the login prompt, then it locks up. > If booted failsafe it will stay up for tens of minutes before locking. > It locked once on "man smartctl" and once on "service network start". > However, on the next reboot, it didn't lock with another "man smartctl", > so it isn't like it hit a bad part of the disk and died. Smartctl test > has not been run, but "smartctl -a /dev/hda" on the one disk shows it as > healthy with no blocks swapped out. Power stays on when it locks, and > the display remains as it was just before the lock. When it locks it > will not respond to either the keyboard or the network. (The network > interface light still flashes.) There is nothing in any of the logs to > indicate the nature of the problem. > > The odd thing is that the system is remarkably stable in some ways. For > instance, the PS tests good and heat isn't the issue: after running > sensors in a tight loop to a log file, waiting for it to lock up, then > looking at the log on the next failsafe boot, there were negligible > fluctuation on any of the voltages, fan speeds, or temperatures. It > will happily sit for 30 minutes in the BIOS, or hours running memtest86 > (without errors). The motherboard battery is good, and the inside of > the case is very clean, with no dust visible at all. Reset the BIOS but > it didn't change anything. > > Here are my current hypotheses for what's wrong with this beast: > > 1. The drive is failing electrically, puts voltage spikes out on some > operations, and these crash the system. > 2. The motherboard capacitors are failing and letting too much noise in. > The noise which is fatal is only seen on an active system, so sitting > in the BIOS or in Memtest86 does not do it. (But the caps all look good, > no swelling, no leaks.) It will run memtest86 overnight though, just in > case. > 3. The PS capacitors are failing, so that when loaded there is enough > voltage fluctuation to crash the system. (Does not agree very well with > the sensors measurements, but it could be really high frequency noise > superimposed on a steady base voltage.) > 4. Evil Djinn ;-( > > Any thoughts on what else this might be? > > Thanks. > > David Mathog > [email protected] > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, [email protected] sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
