On 2013-10-03 22:12, David Parker wrote: > Thanks, Andrew. The goal was to use either Pacemaker and Corosync 1.x > from the Debain packages, or use both compiled from source. So, with > the compiled version, I was hoping to avoid CMAN. However, it seems the > packaged version of Pacemaker doesn't support CMAN anyway, so it's moot. > > I rebuilt my VMs from scratch, re-installed Pacemaker and Corosync from > the Debian packages, but I'm still having an odd problem. Here is the > config portion of my CIB: > > <crm_config> > <cluster_property_set id="cib-bootstrap-options"> > <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" > value="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff"/> > <nvpair id="cib-bootstrap-options-cluster-infrastructure" > name="cluster-infrastructure" value="openais"/> > <nvpair id="cib-bootstrap-options-expected-quorum-votes" > name="expected-quorum-votes" value="2"/> > <nvpair id="cib-bootstrap-options-stonith-enabled" > name="stonith-enabled" value="false"/> > <nvpair id="cib-bootstrap-options-no-quorum-policy" > name="no-quorum-policy" value="ignore"/> > </cluster_property_set> > </crm_config> > > I set no-quorum-policy=ignore based on the documentation example for a > 2-node cluster. But when Pacemaker starts up on the first node, the > DRBD resource is in slave mode and none of the other resources are > started (they depend on DRBD being master), and I see these lines in the > log: > > Oct 03 15:29:18 test-vm-1 pengine: [3742]: notice: unpack_config: On > loss of CCM Quorum: Ignore > Oct 03 15:29:18 test-vm-1 pengine: [3742]: notice: LogActions: Start > nfs_fs (test-vm-1 - blocked) > Oct 03 15:29:18 test-vm-1 pengine: [3742]: notice: LogActions: Start > nfs_ip (test-vm-1 - blocked) > Oct 03 15:29:18 test-vm-1 pengine: [3742]: notice: LogActions: Start > nfs (test-vm-1 - blocked) > Oct 03 15:29:18 test-vm-1 pengine: [3742]: notice: LogActions: Start > drbd_r0:0 (test-vm-1) > > I'm assuming the NFS resources show "blocked" because the resource they > depend on is not in the correct state. > > Even when the second node (test-vm-2) comes online, the state of these > resources does not change. I can shutdown and re-start Pacemaker over > and over again on test-vm-2, but nothihg changes. However... and this > is where it gets weird... if I shut down Pacemaker on test-vm-1, then > all of the resources immediately fail over to test-vm-2 and start > correctly. And I see these lines in the log: > > Oct 03 15:44:26 test-vm-1 pengine: [5305]: notice: unpack_config: On > loss of CCM Quorum: Ignore > Oct 03 15:44:26 test-vm-1 pengine: [5305]: notice: stage6: Scheduling > Node test-vm-1 for shutdown > Oct 03 15:44:26 test-vm-1 pengine: [5305]: notice: LogActions: Start > nfs_fs (test-vm-2) > Oct 03 15:44:26 test-vm-1 pengine: [5305]: notice: LogActions: Start > nfs_ip (test-vm-2) > Oct 03 15:44:26 test-vm-1 pengine: [5305]: notice: LogActions: Start > nfs (test-vm-2) > Oct 03 15:44:26 test-vm-1 pengine: [5305]: notice: LogActions: Stop > drbd_r0:0 (test-vm-1) > Oct 03 15:44:26 test-vm-1 pengine: [5305]: notice: LogActions: Promote > drbd_r0:1 (Slave -> Master test-vm-2) > > After that, I can generally move the resources back and forth, and even > fail them over by hard-failing a node, without any problems. The real > problem is that this isn't consistent, though. Every once in a while, > I'll hard-fail a node and the other one will go into this "stuck" state > where Pacemaker knows it lost a node, but DRBD will stay in slave mode > and the other resources will never start. It seems to happen quite > randomly. Then, even if I restart Pacemaker on both nodes, or reboot > them altogether, I run into the startup issue mentioned previously. > > Any ideas?
Yes, share your complete resource configuration ;-) Regards, Andreas > > Thanks, > Dave > > > > On Wed, Oct 2, 2013 at 1:01 AM, Andrew Beekhof <and...@beekhof.net > <mailto:and...@beekhof.net>> wrote: > > > On 02/10/2013, at 5:24 AM, David Parker <dpar...@utica.edu > <mailto:dpar...@utica.edu>> wrote: > > > Thanks, I did a little Googling and found the git repository for pcs. > > pcs won't help you rebuild pacemaker with cman support (or corosync > 2.x support) turned on though. > > > > Is there any way to make a two-node cluster work with the stock > Debian packages, though? It seems odd that this would be impossible. > > it really depends how the debian maintainers built pacemaker. > by the sounds of it, it only supports the pacemaker plugin mode for > corosync 1.x > > > > > > > On Tue, Oct 1, 2013 at 3:16 PM, Larry Brigman > <larry.brig...@gmail.com <mailto:larry.brig...@gmail.com>> wrote: > > pcs is another package you will need to install. > > > > On Oct 1, 2013 9:04 AM, "David Parker" <dpar...@utica.edu > <mailto:dpar...@utica.edu>> wrote: > > Hello, > > > > Sorry for the delay in my reply. I've been doing a lot of > experimentation, but so far I've had no luck. > > > > Thanks for the suggestion, but it seems I'm not able to use CMAN. > I'm running Debian Wheezy with Corosync and Pacemaker installed via > apt-get. When I installed CMAN and set up a cluster.conf file, > Pacemaker refused to start and said that CMAN was not supported. > When CMAN is not installed, Pacemaker starts up fine, but I see > these lines in the log: > > > > Sep 30 23:36:29 test-vm-1 crmd: [6941]: ERROR: > init_quorum_connection: The Corosync quorum API is not supported in > this build > > Sep 30 23:36:29 test-vm-1 pacemakerd: [6932]: ERROR: > pcmk_child_exit: Child process crmd exited (pid=6941, rc=100) > > Sep 30 23:36:29 test-vm-1 pacemakerd: [6932]: WARN: > pcmk_child_exit: Pacemaker child process crmd no longer wishes to be > respawned. Shutting ourselves down. > > > > So, then I checked to see which plugins are supported: > > > > # pacemakerd -F > > Pacemaker 1.1.7 (Build: ee0730e13d124c3d58f00016c3376a1de5323cff) > > Supporting: generated-manpages agent-manpages ncurses heartbeat > corosync-plugin snmp libesmtp > > > > Am I correct in believing that this Pacemaker package has been > compiled without support for any quorum API? If so, does anyone > know if there is a Debian package which has the correct support? > > > > I also tried compiling LibQB, Corosync and Pacemaker from source > via git, following the instructions documented here: > > > > http://clusterlabs.org/wiki/SourceInstall > > > > I was hopeful that this would work, because as I understand it, > Corosync 2.x no longer uses CMAN. Everything compiled and started > fine, but the compiled version of Pacemaker did not include either > the 'crm' or 'pcs' commands. Do I need to install something else in > order to get one of these? > > > > Any and all help is greatly appreciated! > > > > Thanks, > > Dave > > > > > > On Wed, Sep 25, 2013 at 6:08 AM, David Lang <da...@lang.hm > <mailto:da...@lang.hm>> wrote: > > the cluster is trying to reach a quarum (the majority of the nodes > talking to each other) and that is never going to happen with only > one node. so you have to disable this. > > > > try putting > > <cman two_node="1" expected_votes="1" transport="udpu"/> > > in your cluster.conf > > > > David Lang > > > > On Tue, 24 Sep 2013, David Parker wrote: > > > > Date: Tue, 24 Sep 2013 11:48:59 -0400 > > From: David Parker <dpar...@utica.edu <mailto:dpar...@utica.edu>> > > Reply-To: The Pacemaker cluster resource manager > > <pacemaker@oss.clusterlabs.org > <mailto:pacemaker@oss.clusterlabs.org>> > > To: The Pacemaker cluster resource manager > <pacemaker@oss.clusterlabs.org <mailto:pacemaker@oss.clusterlabs.org>> > > Subject: Re: [Pacemaker] Corosync won't recover when a node fails > > > > > > I forgot to mention, OS is Debian Wheezy 64-bit, Corosync and > Pacemaker > > installed from packages via apt-get, and there are no local > firewall rules > > in place: > > > > # iptables -L > > Chain INPUT (policy ACCEPT) > > target prot opt source destination > > > > Chain FORWARD (policy ACCEPT) > > target prot opt source destination > > > > Chain OUTPUT (policy ACCEPT) > > target prot opt source destination > > > > > > On Tue, Sep 24, 2013 at 11:41 AM, David Parker <dpar...@utica.edu > <mailto:dpar...@utica.edu>> wrote: > > > > Hello, > > > > I have a 2-node cluster using Corosync and Pacemaker, where the > nodes are > > actually to VirtualBox VMs on the same physical machine. I have some > > resources set up in Pacemaker, and everything works fine if I move > them in > > a controlled way with the "crm_resource -r <resource> --move > --node <node>" > > command. > > > > However, when I hard-fail one of the nodes via the "poweroff" > command in > > Virtual Box, which "pulls the plug" on the VM, the resources do > not move, > > and I see the following output in the log on the remaining node: > > > > Sep 24 11:20:30 corosync [TOTEM ] The token was lost in the > OPERATIONAL > > state. > > Sep 24 11:20:30 corosync [TOTEM ] A processor failed, forming new > > configuration. > > Sep 24 11:20:30 corosync [TOTEM ] entering GATHER state from 2. > > Sep 24 11:20:31 test-vm-2 lrmd: [2503]: debug: rsc:drbd_r0:0 > monitor[31] > > (pid 8495) > > drbd[8495]: 2013/09/24_11:20:31 WARNING: This resource agent is > > deprecated and may be removed in a future release. See the man > page for > > details. To suppress this warning, set the "ignore_deprecation" > resource > > parameter to true. > > drbd[8495]: 2013/09/24_11:20:31 WARNING: This resource agent is > > deprecated and may be removed in a future release. See the man > page for > > details. To suppress this warning, set the "ignore_deprecation" > resource > > parameter to true. > > drbd[8495]: 2013/09/24_11:20:31 DEBUG: r0: Calling drbdadm -c > > /etc/drbd.conf role r0 > > drbd[8495]: 2013/09/24_11:20:31 DEBUG: r0: Exit code 0 > > drbd[8495]: 2013/09/24_11:20:31 DEBUG: r0: Command output: > > Secondary/Primary > > drbd[8495]: 2013/09/24_11:20:31 DEBUG: r0: Calling drbdadm -c > > /etc/drbd.conf cstate r0 > > drbd[8495]: 2013/09/24_11:20:31 DEBUG: r0: Exit code 0 > > drbd[8495]: 2013/09/24_11:20:31 DEBUG: r0: Command output: > Connected > > drbd[8495]: 2013/09/24_11:20:31 DEBUG: r0 status: > Secondary/Primary > > Secondary Primary Connected > > Sep 24 11:20:31 test-vm-2 lrmd: [2503]: info: operation monitor[31] on > > drbd_r0:0 for client 2506: pid 8495 exited with return code 0 > > Sep 24 11:20:32 corosync [TOTEM ] entering GATHER state from 0. > > Sep 24 11:20:34 corosync [TOTEM ] The consensus timeout expired. > > Sep 24 11:20:34 corosync [TOTEM ] entering GATHER state from 3. > > Sep 24 11:20:36 corosync [TOTEM ] The consensus timeout expired. > > Sep 24 11:20:36 corosync [TOTEM ] entering GATHER state from 3. > > Sep 24 11:20:38 corosync [TOTEM ] The consensus timeout expired. > > Sep 24 11:20:38 corosync [TOTEM ] entering GATHER state from 3. > > Sep 24 11:20:40 corosync [TOTEM ] The consensus timeout expired. > > Sep 24 11:20:40 corosync [TOTEM ] entering GATHER state from 3. > > Sep 24 11:20:40 corosync [TOTEM ] Totem is unable to form a cluster > > because of an operating system or network fault. The most common > cause of > > this message is that the local firewall is configured improperly. > > Sep 24 11:20:43 corosync [TOTEM ] The consensus timeout expired. > > Sep 24 11:20:43 corosync [TOTEM ] entering GATHER state from 3. > > Sep 24 11:20:43 corosync [TOTEM ] Totem is unable to form a cluster > > because of an operating system or network fault. The most common > cause of > > this message is that the local firewall is configured improperly. > > Sep 24 11:20:45 corosync [TOTEM ] The consensus timeout expired. > > Sep 24 11:20:45 corosync [TOTEM ] entering GATHER state from 3. > > Sep 24 11:20:45 corosync [TOTEM ] Totem is unable to form a cluster > > because of an operating system or network fault. The most common > cause of > > this message is that the local firewall is configured improperly. > > Sep 24 11:20:47 corosync [TOTEM ] The consensus timeout expired. > > > > Those last 3 messages just repeat over and over, the cluster never > > recovers, and the resources never move. "crm_mon" reports that the > > resources are still running on the dead node, and shows no > indication that > > anything has gone wrong. > > > > Does anyone know what the issue could be? My expectation was that the > > remaining node would become the sole member of the cluster, take > over the > > resources, and everything would keep running. > > > > For reference, my corosync.conf file is below: > > > > compatibility: whitetank > > > > totem { > > version: 2 > > secauth: off > > interface { > > member { > > memberaddr: 192.168.25.201 > > } > > member { > > memberaddr: 192.168.25.202 > > } > > ringnumber: 0 > > bindnetaddr: 192.168.25.0 > > mcastport: 5405 > > } > > transport: udpu > > } > > > > logging { > > fileline: off > > to_logfile: yes > > to_syslog: yes > > debug: on > > logfile: /var/log/cluster/corosync.log > > timestamp: on > > logger_subsys { > > subsys: AMF > > debug: on > > } > > } > > > > > > Thanks! > > Dave > > > > -- > > Dave Parker > > Systems Administrator > > Utica College > > Integrated Information Technology Services > > (315) 792-3229 > > Registered Linux User #408177 > > > > > > > > > > > > _______________________________________________ > > > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > <mailto:Pacemaker@oss.clusterlabs.org> > > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > > > > > Project Home: http://www.clusterlabs.org > > > > Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > > > Bugs: http://bugs.clusterlabs.org > > > > > > _______________________________________________ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > <mailto:Pacemaker@oss.clusterlabs.org> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > > > > > > > > -- > > Dave Parker > > Systems Administrator > > Utica College > > Integrated Information Technology Services > > (315) 792-3229 <tel:%28315%29%20792-3229> > > Registered Linux User #408177 > > > > _______________________________________________ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > <mailto:Pacemaker@oss.clusterlabs.org> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > > > > _______________________________________________ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > <mailto:Pacemaker@oss.clusterlabs.org> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > > > > > > > > -- > > Dave Parker > > Systems Administrator > > Utica College > > Integrated Information Technology Services > > (315) 792-3229 <tel:%28315%29%20792-3229> > > Registered Linux User #408177 > > _______________________________________________ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > <mailto:Pacemaker@oss.clusterlabs.org> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > <mailto:Pacemaker@oss.clusterlabs.org> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > > > > -- > Dave Parker > Systems Administrator > Utica College > Integrated Information Technology Services > (315) 792-3229 > Registered Linux User #408177 > > > This body part will be downloaded on demand. > -- Need help with Pacemaker? http://www.hastexo.com/now
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org