Any thoughts on this would be much appreciated :) On Wed, Apr 8, 2015 at 5:16 PM, Jorge Lopes <jmclo...@gmail.com> wrote:
> (I'm a bit confused because I received an auto-reply form > pacemaker-boun...@oss.clusterlabs.org saying this list is inactive now > but I just received a digest with my mail. I happens that I have resent the > email to the new list with a bit more information, which was missing in the > first message. So here it is that extra bit, anyway). > > I also have noticed this pattern (with both STONITH resources running): > 1. With the cluster running without errors, I run "stop docker" in node > cluster-a-1. > 2. This leads to the vCenter STONITH to act as expected. > 3. After the cluster is running again without errors, I run again "stop > docker" in node cluster-a-1. > 4. Now, the vCenter STONITH doesn't run and, instead, it is the IPMI > STONITH that runs. This is unexpected for me, as I was expecting to see the > vCenter STONITH to run again. > > > On Wed, Apr 8, 2015 at 4:20 PM, Jorge Lopes <jmclo...@gmail.com> wrote: > >> Hi all. >> >> I'm having difficulties orchestrating two STONITH devices in my cluster. >> I have been struggling with this in past days and I need some help, please. >> >> A simplified version of my cluster and its goals is as follows: >> - The cluster has two physical servers, each with two nodes (VMWare >> virtual machines): overall, there are 4 nodes in this simplified version. >> - There are two resource groups: group-cluster-a and group-cluster-b. >> - To achieve a good CPU balance in the physical servers, the cluster is >> asymmetric, with one group running in one server and the other group >> running on the other server. >> - If the VM of one host becomes not usable, then its resources are >> started in its sister VM deployed in the other physical host. >> - If one physical host becomes not usable, then all resources are started >> in the other physical host. >> - Two STONITH levels are used to fence the problematic nodes. >> >> The resources have the following behavior: >> - If the resource monitor detects a problem, then Pacemaker tries to >> restart the resource in the same node. >> - If it fails, then STONITH takes place (vcenter reboots the VM) and >> Pacemaker starts the resource in the sister VM present in the other >> physical host. >> - If restarting the VM fails, I want to power off the physical server and >> Pacemaker will start all resources in the other physical host. >> >> >> The HA stack is: >> Ubuntu 14.04 (the node OS, which is a visualized guest running in VMWare >> ESXi 5.5) >> Pacemaker 1.1.12 >> Corosync 2.3.4 >> CRM 2.1.2 >> >> The 4 nodes are: >> cluster-a-1 >> cluster-a-2 >> cluster-b-1 >> cluster-b-2 >> >> The relevant configuration is: >> >> property symmetric-cluster=false >> property stonith-enabled=true >> property no-quorum-policy=stop >> >> group group-cluster-a vip-cluster-a docker-web >> location loc-group-cluster-a-1 group-cluster-a inf: cluster-a-1 >> location loc-group-cluster-a-2 group-cluster-a 500: cluster-a-2 >> >> group group-cluster-b vip-cluster-b docker-srv >> location loc-group-cluster-b-1 group-cluster-b 500: cluster-b-1 >> location loc-group-cluster-b-2 group-cluster-b inf: cluster-b-2 >> >> >> # stonith vcenter definitions for host 1 >> # run in any of the host2 VM >> primitive stonith-vcenter-host1 stonith:external/vcenter \ >> params \ >> VI_SERVER="192.168.40.20" \ >> VI_CREDSTORE="/etc/vicredentials.xml" \ >> HOSTLIST="cluster-a-1=cluster-a-1;cluster-a-2=cluster-a-2" \ >> RESETPOWERON="1" \ >> priority="2" \ >> pcmk_host_check="static-list" \ >> pcmk_host_list="cluster-a-1 cluster-a-2" \ >> op monitor interval="60s" >> >> location loc1-stonith-vcenter-host1 stonith-vcenter-host1 500: cluster-b-1 >> location loc2-stonith-vcenter-host1 stonith-vcenter-host1 501: cluster-b-2 >> >> # stonith vcenter definitions for host 2 >> # run in any of the host1 VM >> primitive stonith-vcenter-host2 stonith:external/vcenter \ >> params \ >> VI_SERVER="192.168.40.21" \ >> VI_CREDSTORE="/etc/vicredentials.xml" \ >> HOSTLIST="cluster-b-1=cluster-b-1;cluster-b-2=cluster-b-2" \ >> RESETPOWERON="1" \ >> priority="2" \ >> pcmk_host_check="static-list" \ >> pcmk_host_list="cluster-b-1 cluster-b-2" \ >> op monitor interval="60s" >> >> location loc1-stonith-vcenter-host2 stonith-vcenter-host2 500: cluster-a-1 >> location loc2-stonith-vcenter-host2 stonith-vcenter-host2 501: cluster-a-2 >> >> >> # stonith IPMI definitions for host 1 (DELL with iDRAC 7 enterprise >> interface at 192.168.40.15) >> # run in any of the host2 VM >> primitive stonith-ipmi-host1 stonith:external/ipmi \ >> params hostname="host1" ipaddr="192.168.40.15" userid="root" >> passwd="mypassword" interface="lanplus" \ >> priority="1" \ >> pcmk_host_check="static-list" \ >> pcmk_host_list="cluster-a-1 cluster-a-2" \ >> op start interval="0" timeout="60s" requires="nothing" \ >> op monitor interval="3600s" timeout="20s" requires="nothing" >> >> location loc1-stonith-ipmi-host1 stonith-ipmi-host1 500: cluster-b-1 >> location loc2-stonith-ipmi-host1 stonith-ipmi-host1 501: cluster-b-2 >> >> >> # stonith IPMI definitions for host 2 (DELL with iDRAC 7 enterprise >> interface at 192.168.40.16) >> # run in any of the host1 VM >> primitive stonith-ipmi-host2 stonith:external/ipmi \ >> params hostname="host2" ipaddr="192.168.40.16" userid="root" >> passwd="mypassword" interface="lanplus" \ >> priority="1" \ >> pcmk_host_check="static-list" \ >> pcmk_host_list="cluster-b-1 cluster-b-2" \ >> op start interval="0" timeout="60s" requires="nothing" \ >> op monitor interval="3600s" timeout="20s" requires="nothing" >> >> location loc1-stonith-ipmi-host2 stonith-ipmi-host2 500: cluster-a-1 >> location loc2-stonith-ipmi-host2 stonith-ipmi-host2 501: cluster-a-2 >> >> >> What is working: >> - When an error is detected in one resource, the resource restart in the >> same node, as expected. >> - With the STONITH external/ipmi resource *stopped*, a fail in one node >> makes the vcenter rebooting it and the resources starts in the sister node. >> >> >> What is not so good: >> - When vcenter reboots one node, then the resource starts in the other >> node as expected but then they return to the original node as soon as it >> becomes online. This makes a bit of ping-pong and I think it is a >> consequence of how the locations are defined. Any suggestion to avoid this? >> After the resource was moved to another node, I would prefer that it stays >> there, instead of returning it to the original node. I can think of playing >> with the resource affinity scores - is this way it should be done? >> >> What is wrong: >> Lets consider this scenario. >> I have a set of resources provided by a docker agent. My test consists in >> stopping the docker service in the node cluster-a-1, which makes the docker >> agent to return OCF_ERR_INSTALLED to Pacemaker (this is a change I made in >> the docker agent, when compared to the github repository version). With the >> IPMI STONITH resource stopped, this leads to the node cluster-a-1 restart, >> which is expected. >> >> But with the IPMI STONITH resource started, I notice an erratic behavior: >> - Some times, the resources at the node cluster-a-1 are stopped and no >> STONITH happens. Also, the resources are not moved to the node cluster-a-2. >> In this situation, if I manually restart the node cluster-a-1 (virtual >> machine restart), then the IPMI STONITH takes place and restarts the >> corresponding physical server. >> - Sometimes, the IPMI STONITH starts before the vCenter STONITH, which is >> not expected because the vCenter STONITH has higher priority. >> >> I might have something wrong in my stonith definition, but I can't figure >> what. >> Any idea how to correct this? >> >> And how can I set external/ipmi to power off the physical host, instead >> of rebooting it? >> >> >
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org