Hi David

Some of the built-in 3Com Ethernet 100 interfaces on
Tyan S2466[-4M] motherboards we have here became flaky/failed
after many years of use.
Those are main boards in in several standalone workstations/PCs.
I don't administer those systems, but I believe the symptoms
were somewhat random, as those you describe.

Disabling the onboard Ethernet (by jumper), and replacing them by
PCI Ethernet 100 cards, gave those systems additional lifetime.
Would this be the case of your cluster node?

Interesting that I also posted today a note asking for help with
Gigabit Ethernet on these very same motherboards!
We also have them in an old workhorse cluster.

Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


Gerald Creager wrote:
David Mathog wrote:
So we have a cluster of Tyan S2466 nodes and one of them has failed in
an odd way. (Yes, these are very old, and they would be gone if we had a
replacment.)  On applying power the system boots normally and gets far
into the boot sequence, sometimes to the login prompt, then it locks up.
 If booted failsafe it will stay up for tens of minutes before locking.
It locked once on "man smartctl" and once on "service network start". However, on the next reboot, it didn't lock with another "man smartctl",
so it isn't like it hit a bad part of the disk and died.  Smartctl test
has not been run, but "smartctl -a /dev/hda" on the one disk shows it as
healthy with no blocks swapped out.  Power stays on when it locks, and
the display remains as it was just before the lock.  When it locks it
will not respond to either the keyboard or the network.  (The network
interface light still flashes.)  There is nothing in any of the logs to
indicate the nature of the problem.

The odd thing is that the system is remarkably stable in some ways.  For
instance, the PS tests good and heat isn't the issue: after running
sensors in a tight loop to a log file, waiting for it to lock up, then
looking at the log on the next failsafe boot, there were negligible
fluctuation on any of the voltages, fan speeds, or temperatures.  It
will happily sit for 30 minutes in the BIOS, or hours running memtest86
(without errors).  The motherboard battery is good, and the inside of
the case is very clean, with no dust visible at all.  Reset the BIOS but
it didn't change anything.

Here are my current hypotheses for what's wrong with this beast:

1. The drive is failing electrically, puts voltage spikes out on some
operations, and these crash the system.
2. The motherboard capacitors are failing and letting too much noise in.
 The noise which is fatal is only seen on an active system, so sitting
in the BIOS or in Memtest86 does not do it. (But the caps all look good,
no swelling, no leaks.)  It will run memtest86 overnight though, just in
case.
3. The PS capacitors are failing, so that when loaded there is enough
voltage fluctuation to crash the system.  (Does not agree very well with
the sensors measurements, but it could be really high frequency noise
superimposed on a steady base voltage.)
4. Evil Djinn ;-(

Any thoughts on what else this might be?


I'd also be suspicious of memory failures. We have had DIMM failures that were unseen on repeated MemTest86 runs until they failed hard, hard, HARD. While they were still trying to decide, they'd pass MemTest and we'd try using them.

Capacitor failures are a potential problem but if the systems have been in a stable environment and not subject to a lot of thermal stressors, they should be fine. Especially the power supply caps shouldn't decide to get old and fail (I'm assuming you're talking electrolytics). The old paper electrolytics might have exhibited this behavior, but not even tantalums will do this. And, if tantalum caps go, they tend to be more spectacular and take lots of other parts with them.

More to the point, (ceramic) chip caps that haven't been in a wet/moist/temp-varying, humid environment shouldn't crack and fail.

Option 4 has potential, though.

gc
_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to