Re: [ClusterLabs] group resources not grouped ?!?
On 10/08/2015 12:57 PM, Jorge Fábregas wrote: On 10/08/2015 06:04 AM, zulucloud wrote: are there any other ways? Hi, You might want to check external/vmware or external/vcenter. I've never used them but apparently one is used to fence via the hypervisor (ESXi itself) and the other thru vCenter. Hello, thank you very much, this looks interesting... maybe another option. I'll give it a closer look.. brgds ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] group resources not grouped ?!?
On 10/08/2015 06:04 AM, zulucloud wrote: > are there any other ways? Hi, You might want to check external/vmware or external/vcenter. I've never used them but apparently one is used to fence via the hypervisor (ESXi itself) and the other thru vCenter. -- Jorge ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] group resources not grouped ?!?
On 10/07/2015 07:09 PM, Ken Gaillot wrote: So, to proceed: 1) Stonith would help :) Hi Ken, all, my two nodes are virtual machines hosted on two vmware servers, which are hosting other virtual machines as well. A shared storage could be made available to my nodes, although without a cluster filesystem. I've read http://clusterlabs.org/doc/crm_fencing.html and parts of http://www.linux-ha.org/wiki/SBD_Fencing . My conclusion is that there are just two possibilities to achieve stonith / fencing that make sense in that particular situation: "external/sbd" or "external/ssh". Both are doing node level, not resource level fencing. As "external/ssh" is considered as to be just for testing purposes, my way to go would be "external/sbd". Is that correct, or are there any other ways? thx :) ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] group resources not grouped ?!?
On 10/07/2015 09:12 AM, zulucloud wrote: > Hi, > i got a problem i don't understand, maybe someone can give me a hint. > > My 2-node cluster (named ali and baba) is configured to run mysql, an IP > for mysql and the filesystem resource (on drbd master) together as a > GROUP. After doing some crash-tests i ended up having filesystem and > mysql running happily on one host (ali), and the related IP on the other > (baba) although, the IP's not really up and running, crm_mon just > SHOWS it as started there. In fact it's nowhere up, neither on ali nor > on baba. > > crm_mon shows that pacemaker tried to start it on baba, but gave up > after fail-count=100. > > Q1: why doesn't pacemaker put the IP on ali, where all the rest of it's > group lives? > Q2: why doesn't pacemaker try to start the IP on ali, after max > failcount had been reached on baba? > Q3: why is crm_mon showing the IP as "started", when it's down after > 10 tries? > > Thanks :) > > > config (some parts removed): > --- > node ali > node baba > > primitive res_drbd ocf:linbit:drbd \ > params drbd_resource="r0" \ > op stop interval="0" timeout="100" \ > op start interval="0" timeout="240" \ > op promote interval="0" timeout="90" \ > op demote interval="0" timeout="90" \ > op notify interval="0" timeout="90" \ > op monitor interval="40" role="Slave" timeout="20" \ > op monitor interval="20" role="Master" timeout="20" > primitive res_fs ocf:heartbeat:Filesystem \ > params device="/dev/drbd0" directory="/drbd_mnt" fstype="ext4" \ > op monitor interval="30s" > primitive res_hamysql_ip ocf:heartbeat:IPaddr2 \ > params ip="XXX.XXX.XXX.224" nic="eth0" cidr_netmask="23" \ > op monitor interval="10s" timeout="20s" depth="0" > primitive res_mysql lsb:mysql \ > op start interval="0" timeout="15" \ > op stop interval="0" timeout="15" \ > op monitor start-delay="30" interval="15" time-out="15" > > group gr_mysqlgroup res_fs res_mysql res_hamysql_ip \ > meta target-role="Started" > ms ms_drbd res_drbd \ > meta master-max="1" master-node-max="1" clone-max="2" > clone-node-max="1" notify="true" > > colocation col_fs_on_drbd_master inf: res_fs:Started ms_drbd:Master > > order ord_drbd_master_then_fs inf: ms_drbd:promote res_fs:start > > property $id="cib-bootstrap-options" \ > dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \ > cluster-infrastructure="openais" \ > stonith-enabled="false" \ Not having stonith is part of the problem (see below). Without stonith, if the two nodes go into split brain (both up but can't communicate with each other), Pacemaker will try to promote DRBD to master on both nodes, mount the filesystem on both nodes, and start MySQL on both nodes. > no-quorum-policy="ignore" \ > expected-quorum-votes="2" \ > last-lrm-refresh="1438857246" > > > crm_mon -rnf (some parts removed): > - > Node ali: online > res_fs (ocf::heartbeat:Filesystem) Started > res_mysql (lsb:mysql) Started > res_drbd:0 (ocf::linbit:drbd) Master > Node baba: online > res_hamysql_ip (ocf::heartbeat:IPaddr2) Started > res_drbd:1 (ocf::linbit:drbd) Slave > > Inactive resources: > > Migration summary: > > * Node baba: >res_hamysql_ip: migration-threshold=100 fail-count=100 > > Failed actions: > res_hamysql_ip_stop_0 (node=a891vl107s, call=35, rc=1, > status=complete): unknown error The "_stop_" above means that a *stop* action on the IP failed. Pacemaker tried to migrate the IP by first stopping it on baba, but it couldn't. (Since the IP is the last member of the group, its failure didn't prevent the other members from moving.) Normally, when a stop fails, Pacemaker fences the node so it can safely bring up the resource on the other node. But you disabled stonith, so it got into this state. So, to proceed: 1) Stonith would help :) 2) Figure out why it couldn't stop the IP. There might be a clue in the logs on baba (though they are indeed hard to follow; search for "res_hamysql_stop_0" around this time, and look around there). You could also try adding and removing the IP manually, first with the usual OS commands, and if that works, by calling the IP resource agent directly. That often turns up the problem. > > corosync.log: > -- > pengine: [1223]: WARN: should_dump_input: Ignoring requirement that > res_hamysql_ip_stop_0 comeplete before gr_mysqlgroup_stopped_0: > unmanaged failed resources cannot prevent shutdown > > pengine: [1223]: WARN: should_dump_input: Ignoring requirement that > res_hamysql_ip_stop_0 comeplete before gr_mysqlgroup_stopped_0: > unmanaged failed resources cannot prevent shutdown > > Software: > -- > corosync 1.2.1-4 > pacemaker 1.0.9.1+hg15626-1 > drbd8-utils 2:8.3.7-2.1 > (for some reason it's not possible to update at this time) It should be possible to get
[ClusterLabs] group resources not grouped ?!?
Hi, i got a problem i don't understand, maybe someone can give me a hint. My 2-node cluster (named ali and baba) is configured to run mysql, an IP for mysql and the filesystem resource (on drbd master) together as a GROUP. After doing some crash-tests i ended up having filesystem and mysql running happily on one host (ali), and the related IP on the other (baba) although, the IP's not really up and running, crm_mon just SHOWS it as started there. In fact it's nowhere up, neither on ali nor on baba. crm_mon shows that pacemaker tried to start it on baba, but gave up after fail-count=100. Q1: why doesn't pacemaker put the IP on ali, where all the rest of it's group lives? Q2: why doesn't pacemaker try to start the IP on ali, after max failcount had been reached on baba? Q3: why is crm_mon showing the IP as "started", when it's down after 10 tries? Thanks :) config (some parts removed): --- node ali node baba primitive res_drbd ocf:linbit:drbd \ params drbd_resource="r0" \ op stop interval="0" timeout="100" \ op start interval="0" timeout="240" \ op promote interval="0" timeout="90" \ op demote interval="0" timeout="90" \ op notify interval="0" timeout="90" \ op monitor interval="40" role="Slave" timeout="20" \ op monitor interval="20" role="Master" timeout="20" primitive res_fs ocf:heartbeat:Filesystem \ params device="/dev/drbd0" directory="/drbd_mnt" fstype="ext4" \ op monitor interval="30s" primitive res_hamysql_ip ocf:heartbeat:IPaddr2 \ params ip="XXX.XXX.XXX.224" nic="eth0" cidr_netmask="23" \ op monitor interval="10s" timeout="20s" depth="0" primitive res_mysql lsb:mysql \ op start interval="0" timeout="15" \ op stop interval="0" timeout="15" \ op monitor start-delay="30" interval="15" time-out="15" group gr_mysqlgroup res_fs res_mysql res_hamysql_ip \ meta target-role="Started" ms ms_drbd res_drbd \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" colocation col_fs_on_drbd_master inf: res_fs:Started ms_drbd:Master order ord_drbd_master_then_fs inf: ms_drbd:promote res_fs:start property $id="cib-bootstrap-options" \ dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \ cluster-infrastructure="openais" \ stonith-enabled="false" \ no-quorum-policy="ignore" \ expected-quorum-votes="2" \ last-lrm-refresh="1438857246" crm_mon -rnf (some parts removed): - Node ali: online res_fs (ocf::heartbeat:Filesystem) Started res_mysql (lsb:mysql) Started res_drbd:0 (ocf::linbit:drbd) Master Node baba: online res_hamysql_ip (ocf::heartbeat:IPaddr2) Started res_drbd:1 (ocf::linbit:drbd) Slave Inactive resources: Migration summary: * Node baba: res_hamysql_ip: migration-threshold=100 fail-count=100 Failed actions: res_hamysql_ip_stop_0 (node=a891vl107s, call=35, rc=1, status=complete): unknown error corosync.log: -- pengine: [1223]: WARN: should_dump_input: Ignoring requirement that res_hamysql_ip_stop_0 comeplete before gr_mysqlgroup_stopped_0: unmanaged failed resources cannot prevent shutdown pengine: [1223]: WARN: should_dump_input: Ignoring requirement that res_hamysql_ip_stop_0 comeplete before gr_mysqlgroup_stopped_0: unmanaged failed resources cannot prevent shutdown Software: -- corosync 1.2.1-4 pacemaker 1.0.9.1+hg15626-1 drbd8-utils 2:8.3.7-2.1 (for some reason it's not possible to update at this time) ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org