We are currently deploying Tyan S2882 Dual Opteron Boards, and we have
these are older, well-known, widely installed and certainly _can_ run stable.
found the system to be quite unstable. After BIOS updates and kernel
changes we still get random kernel panics when under load.
have you run memtest86? are you monitoring temperatures?
(and perhaps voltages)
So far we have solved the
- broken BIOS problem with an update to the most recent BIOS.
due to a newer cpu? the cluster I have with S2882's (mixed with
S2881's, I think) hasn't needed any updates, but it's not using
dual-core or anything exotic.
- Discovered that some power supplies can produce problems
http://www.anandtech.com/mb/showdoc.aspx?i=2608
I have a hard time believing this is specific to antec+tyan.
yes, certainly, PS's are a sensitive point, especially if you've
got heavily-configured systems.
- FS corruption due to a firmeware problem in a RAID hardware board
therefore not related to the MB, right?
- MCE chipkill errors (non-fatal) due to apparent bad RAM
also not related to the MB, right? also, you really should expect
some small rate of corrected ECC's on any system; it's only a high
rate that's a problem (or uncorrectable ones, of course...)
To be solved:
- random kernel panics that take out the logging even when all debug
flags are set in the kernel, as it fails to sync the disc during the
kernel panic.
but kernel panics never sync - after all, a panic is specifically
an event from which you can't continue in any way. or am I misunderstanding
what you're saying?
it sounds like you've done a lot of debugging already, but I'd recommend
going back to basics. remove all the io devices, disks, etc and see
whether the board+cpu+memory can run stably, etc.
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf