On Tue, 2010-04-06 at 14:58 -0700, Tony Gan wrote:

> I think the solution is using UPS or PDU for STONITH device.

That could improve things in some scenarios, but it does not completely
solve the problem. The cluster is still vulnerable to having the entire
power strip for one node unlpugged or turned off. No matter what the
stonith device is, there is always the possibility of failure of the
stonith device itself. My goal is to be able to recover from something
like these remotely, before I can actually get there to correct the real
problem.

In fact, the chance that stonith will fail because one of the nodes has
completely lost power due to hardware failure while the other one still
has power is extremely small. They both have dual power supplies and
they both use the same two circuits, so the only way that is at all
likely where I could get into the state I am concerned about would be
human error. Unfortunately we do have a lot of people with machine room
access, which makes the possibility of someone powering off the wrong
machine by mistake a real possibility. The chance that two power
supplies would fail at the same time is remote. Unfortunately, human
error is also possible using a controllable power strip as the stonith
device; that doesn't really solve my problem.

I do think I found something that might work. I'm not sure yet, but it
looks like I can create a stonith:meatware resource in addition to the
stonith:ipmilan resource. That would allow me to manually confirm that
the powerless node is in fact dead and have the remaining node take
over. That confirmation can be done by logging in to the live node
remotely, so it will serve my needs if I can figure out the magic
incantation to configure this correctly.

--Greg


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to