I have a bunch (~50) SC1435 servers as part of an HPC cluster. Over the last several weeks, I'll come to work in the morning to find one of them dead from either a "CPUx thermal tripped" or "CPUx voltage sensor" problem. I'll have to power them back on (or sometimes, unplug them before the power button will work) and view the SEL to see what happened (nothing in the Linux system logs). Once powered back on, they boot/run normally.
I've had this happen across 9 different machines, so I'm thinking this is not just a simple case of flakey hardware. Running CentOS 5 as part of an HPC environment. The cluster jobs push the CPUs, so these machines run hot. These failures are getting old as they crash the jobs on them at the time of the BIOS-induced "power off". I've asked my technical/sales guy to look into this to see if there was perhaps a bad batch of boards, but he can't find anything. I emailed a ticket to Dell, but they want me to call their HPC group (not thrilled with the prospect of staying on the phone for hours while someone tells me to load/run "dset" on all my nodes...) Does this issue ring a bell with anybody? --- Cris -- Cristopher J. Rhea Mayo Clinic - Research Computing Facility 200 First St SW, Rochester, MN 55905 [email protected] (507) 284-0587 _______________________________________________ Linux-PowerEdge mailing list [email protected] https://lists.us.dell.com/mailman/listinfo/linux-poweredge Please read the FAQ at http://lists.us.dell.com/faq
