I assume you have ruled out some kind of marginal power situation in the machine room?
> -----Original Message----- > From: Cris Rhea [mailto:[email protected]] > Sent: Monday, April 19, 2010 12:56 PM > To: [email protected] > Subject: Thermal issues with SC1435 servers?? > > > I have a bunch (~50) SC1435 servers as part of an HPC cluster. > > Over the last several weeks, I'll come to work in the morning to find > one > of them dead from either a "CPUx thermal tripped" or "CPUx voltage > sensor" > problem. I'll have to power them back on (or sometimes, unplug them > before > the power button will work) and view the SEL to see what happened > (nothing > in the Linux system logs). Once powered back on, they boot/run > normally. > > I've had this happen across 9 different machines, so I'm thinking this > is not just a simple case of flakey hardware. > > Running CentOS 5 as part of an HPC environment. The cluster jobs push > the CPUs, so these machines run hot. These failures are getting old as > they crash the jobs on them at the time of the BIOS-induced "power > off". > > I've asked my technical/sales guy to look into this to see if there was > perhaps a bad batch of boards, but he can't find anything. > I emailed a ticket to Dell, but they want me to call their HPC group > (not thrilled with the prospect of staying on the phone for hours while > someone tells me to load/run "dset" on all my nodes...) > > Does this issue ring a bell with anybody? > > --- Cris > > -- > Cristopher J. Rhea > Mayo Clinic - Research Computing Facility > 200 First St SW, Rochester, MN 55905 > [email protected] > (507) 284-0587 > > _______________________________________________ > Linux-PowerEdge mailing list > [email protected] > https://lists.us.dell.com/mailman/listinfo/linux-poweredge > Please read the FAQ at http://lists.us.dell.com/faq _______________________________________________ Linux-PowerEdge mailing list [email protected] https://lists.us.dell.com/mailman/listinfo/linux-poweredge Please read the FAQ at http://lists.us.dell.com/faq
