> But that does not have to be a SPoF for the entire system!  The problem here
> is that a single failure (power loss) causes not only one node to
> go down (and the pdu itself, yes), but the whole system stops working
> properly.  Now you now have to say that one has to equip the pdus with
> redundant power supplies.  Unfortunately I know of no such device.  Which
> brings me to the conclusion that nobody has yet developed a device that works
> as a fully supported and recommended stonith device.  Which is kind of a
> dilemma.
 
Actually, I believe that the different vendor implementations of "lights out" 
systems (DRAC, HP/Compaq ILO, various others) *do* support that in various ways 
and fashions. Dell's RAC has a battery that lasts for up to 30 minutes last 
time I read it's specs. Regardless, with a "lights out" card watching the 
server, you have two paths to positively query the status of a node at the node 
itself, which is enough to be 90% sure it's dead.  

The switched PDU devices in question, generally made by APC, have some 
instabilities and, well, 'difficulties' in their implementations that are not 
well-documented or intuitive. Some models don't inter-operate well with other 
models in a mixed environment. And there's no positive feedback from the node 
itself; you still don't know if the server's dead or just unreachable due to a 
NIC failure. Checking that the ports you THINK the power is on isn't bad, but 
if the PDU is dead or your well-meaning coworker changed the placement of the 
plugs, well...  

A decent design with DRAC is to have two switches. With the nodes that are on 
Switch A, put the DRAC interfaces on Switch B, and vice versa. Switch A and B 
should have separate battery backups; APC does make 'dumb' hot-fail power 
switches that work reliably.  

-K 
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to