> But that does not have to be a SPoF for the entire system! The problem here > is that a single failure (power loss) causes not only one node to > go down (and the pdu itself, yes), but the whole system stops working > properly. Now you now have to say that one has to equip the pdus with > redundant power supplies. Unfortunately I know of no such device. Which > brings me to the conclusion that nobody has yet developed a device that works > as a fully supported and recommended stonith device. Which is kind of a > dilemma. Actually, I believe that the different vendor implementations of "lights out" systems (DRAC, HP/Compaq ILO, various others) *do* support that in various ways and fashions. Dell's RAC has a battery that lasts for up to 30 minutes last time I read it's specs. Regardless, with a "lights out" card watching the server, you have two paths to positively query the status of a node at the node itself, which is enough to be 90% sure it's dead.
The switched PDU devices in question, generally made by APC, have some instabilities and, well, 'difficulties' in their implementations that are not well-documented or intuitive. Some models don't inter-operate well with other models in a mixed environment. And there's no positive feedback from the node itself; you still don't know if the server's dead or just unreachable due to a NIC failure. Checking that the ports you THINK the power is on isn't bad, but if the PDU is dead or your well-meaning coworker changed the placement of the plugs, well... A decent design with DRAC is to have two switches. With the nodes that are on Switch A, put the DRAC interfaces on Switch B, and vice versa. Switch A and B should have separate battery backups; APC does make 'dumb' hot-fail power switches that work reliably. -K _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
