On Tue, 2010-04-06 at 14:58 -0700, Tony Gan wrote: > I think the solution is using UPS or PDU for STONITH device.
That could improve things in some scenarios, but it does not completely solve the problem. The cluster is still vulnerable to having the entire power strip for one node unlpugged or turned off. No matter what the stonith device is, there is always the possibility of failure of the stonith device itself. My goal is to be able to recover from something like these remotely, before I can actually get there to correct the real problem. In fact, the chance that stonith will fail because one of the nodes has completely lost power due to hardware failure while the other one still has power is extremely small. They both have dual power supplies and they both use the same two circuits, so the only way that is at all likely where I could get into the state I am concerned about would be human error. Unfortunately we do have a lot of people with machine room access, which makes the possibility of someone powering off the wrong machine by mistake a real possibility. The chance that two power supplies would fail at the same time is remote. Unfortunately, human error is also possible using a controllable power strip as the stonith device; that doesn't really solve my problem. I do think I found something that might work. I'm not sure yet, but it looks like I can create a stonith:meatware resource in addition to the stonith:ipmilan resource. That would allow me to manually confirm that the powerless node is in fact dead and have the remaining node take over. That confirmation can be done by logging in to the live node remotely, so it will serve my needs if I can figure out the magic incantation to configure this correctly. --Greg _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
