On 29/08/2013, at 4:18 PM, Ulrich Windl <[email protected]> wrote:
> Hi! > > After some short thinking I find that using ssh as STONITH is probably the > wrong thing to do, because it can never STONITH if the target is down already. Correct > (Is it documented that a STONITH operation should work independent from the > target being up or down?) Stonith must be reliable. Anything that doesn't work when the target is down is not reliable. > Maybe some shared storage and a mechanism like sbd is the way to go. With > everything VMs, shared storage shouldn't be a problem. > > So if your VMs fail because the host failed, the cluster will see the loss of > comunication, will try to fence the affected VMs (to be sure they are down), > and then start to rerun failed resources. Right? Assuming the VMs are cluster nodes. Yes. In which case you could use fence_xvm or whatever the suse equivalent is. > > Regards, > Ulrich > > >>>> Alex Sudakar <[email protected]> schrieb am 29.08.2013 um 04:11 in > Nachricht > <calq2s-f-m_b92quh0kbb4ozufstfk2gavyq19+xvsb7of8k...@mail.gmail.com>: >> Hi. I'm running a simple two-node cluster with Red Hat Enterprise >> Linux 6, corosync 1.4.1, pacemaker 1.1.7 and crm. Each node (A & B) >> are KVM VMs in separate physical machines/hypervisors. >> >> I have two separate networks between the machines; network X is a >> bonded pair, network Y a direct link between the two hypervisors. >> I've set up corosync to use X for ring #0 and Y for ring #1. The >> clustered application is a tomcat web application running on top of a >> DRBD block device. I've configured DRBD to use X for its >> communication link (DRBD apparently can't handle more than one >> network/link). I've set up the DRBD 'fence-peer' and >> 'after-resync-target' hooks to call the scripts to manage location >> constraints when network X is down. >> >> I also appreciate the documentation out there that teaches one all of this. >> :-) >> >> Nodes A & B each have two STONITH resources set up for the other. >> Node A, for example, has resources 'fence_B_X' and 'fence_B_Y' which >> are instances of the stonith:fence_virsh resource agent configured to >> use either network X or network Y to ssh to to B's physical host and >> use 'virsh' to power off B. >> >> Everything's currently working as expected, but I have a couple of >> questions about how STONITH works and how I might be able to tweak it. >> >> 1. How can I add STONITH for the physical machines into the mix? >> >> In one of my tests I powered off B's hypervisor while B was running >> the clustered application. A tried to stonith B but both of its fence >> resources failed, since neither could ssh to that hypervisor to run >> virsh - >> >> stonith-ng[1785]: notice: can_fence_host_with_device: Disabling >> port list queries for fence_B_X (0/-1): Unable to connect/login to >> fencing device >> stonith-ng[1785]: info: can_fence_host_with_device: fence_B_X >> can not fence B (aka. 'B'): dynamic-list >> >> That's as expected, of course. As was the cluster's 'freeze' on >> starting the application on A; I've been told on this list that >> pacemaker won't move resources if/while its STONITH operations have >> failed. >> >> How could I add, say, a 'fence_ilo' stonith resource agent to the mix? >> The idea being that, if a node can't successfully use fence_virsh to >> power down the other VM, it can then escalate to trying to power off >> that node's entire hypervisor? >> >> Q 1.1. Could I set up a stonith fence_ilo resource with suitable >> pcmk* parameters to 'map' the cluster node names A & B to resolve to >> the hypervisors' names that the fence_ilo resource would understand >> and use to take down the hypervisor? >> >> Q 1.2. How can I prioritize fence resources? I haven't come across >> that sort of thing; please forgive me if its a basic configuration >> facility (of crm) that I've forgotten (it's not the same as priority >> weights in allocating a resource to a node, but rather the order in >> which each stonith agent is invoked in a stonith situation on that >> node). Can I set up the cluster so fence_A_X and fence_A_Y will be >> tried first before the cluster will attempt 'fence_A_hypervisor'? >> >> Q2. How can I set up STONITH to proceed even if STONITH wasn't achieved? >> >> I understand the golden rule of a pacemaker cluster is not to proceed >> - to 'freeze' - if STONITH fails. And why that rule is in place, to >> avoid the dangers of split-brain. Really. :-) >> >> But in this case - the improbable event that both networks are down - >> my preference is for each node (if up) to independently run the >> application. I.e. for each to assume 'primary' role in its side of >> the DRBD resource and run the tomcat application. I'm okay with that, >> I want that, I'll accept the burden of reconciling antagonistic data >> updates after the crisis is over. Really. :-) >> >> Is there an accepted/standard way to tell a STONITH agent - or >> fence_virsh - to return 'success' even if the STONITH operation >> failed? Or a configuration item that instructs Pacemaker to proceed >> 'after best effort'? >> >> Or do I have to 'hack' my copy of fence_virsh, look for where it >> returns 'Unable to connect/login to fencing device' and have it >> 'pretend' it didn't find it (i.e. the agent lies and says that the VM >> is down)? >> >> Will anyone on this list speak to me if I admit to doing something >> like that? :-) >> >> Many thanks for any help! >> _______________________________________________ >> Linux-HA mailing list >> [email protected] >> http://lists.linux-ha.org/mailman/listinfo/linux-ha >> See also: http://linux-ha.org/ReportingProblems > > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
