Re: [Pacemaker] Corosync won't recover when a node fails

David Parker Tue, 01 Oct 2013 12:33:01 -0700

Thanks, I did a little Googling and found the git repository for pcs.  Is
there any way to make a two-node cluster work with the stock Debian
packages, though?  It seems odd that this would be impossible.



On Tue, Oct 1, 2013 at 3:16 PM, Larry Brigman <larry.brig...@gmail.com>wrote:

> pcs is another package you will need to install.
> On Oct 1, 2013 9:04 AM, "David Parker" <dpar...@utica.edu> wrote:
>
>> Hello,
>>
>> Sorry for the delay in my reply.  I've been doing a lot of
>> experimentation, but so far I've had no luck.
>>
>> Thanks for the suggestion, but it seems I'm not able to use CMAN.  I'm
>> running Debian Wheezy with Corosync and Pacemaker installed via apt-get.
>>  When I installed CMAN and set up a cluster.conf file, Pacemaker refused to
>> start and said that CMAN was not supported.  When CMAN is not installed,
>> Pacemaker starts up fine, but I see these lines in the log:
>>
>> Sep 30 23:36:29 test-vm-1 crmd: [6941]: ERROR: init_quorum_connection:
>> The Corosync quorum API is not supported in this build
>> Sep 30 23:36:29 test-vm-1 pacemakerd: [6932]: ERROR: pcmk_child_exit:
>> Child process crmd exited (pid=6941, rc=100)
>> Sep 30 23:36:29 test-vm-1 pacemakerd: [6932]: WARN: pcmk_child_exit:
>> Pacemaker child process crmd no longer wishes to be respawned. Shutting
>> ourselves down.
>>
>> So, then I checked to see which plugins are supported:
>>
>> # pacemakerd -F
>> Pacemaker 1.1.7 (Build: ee0730e13d124c3d58f00016c3376a1de5323cff)
>>  Supporting:  generated-manpages agent-manpages ncurses  heartbeat
>> corosync-plugin snmp libesmtp
>>
>> Am I correct in believing that this Pacemaker package has been compiled
>> without support for any quorum API?  If so, does anyone know if there is a
>> Debian package which has the correct support?
>>
>> I also tried compiling LibQB, Corosync and Pacemaker from source via git,
>> following the instructions documented here:
>>
>> http://clusterlabs.org/wiki/SourceInstall
>>
>> I was hopeful that this would work, because as I understand it, Corosync
>> 2.x no longer uses CMAN.  Everything compiled and started fine, but the
>> compiled version of Pacemaker did not include either the 'crm' or 'pcs'
>> commands.  Do I need to install something else in order to get one of these?
>>
>> Any and all help is greatly appreciated!
>>
>>     Thanks,
>>     Dave
>>
>>
>> On Wed, Sep 25, 2013 at 6:08 AM, David Lang <da...@lang.hm> wrote:
>>
>>> the cluster is trying to reach a quarum (the majority of the nodes
>>> talking to each other) and that is never going to happen with only one
>>> node. so you have to disable this.
>>>
>>> try putting
>>> <cman two_node="1" expected_votes="1" transport="udpu"/>
>>> in your cluster.conf
>>>
>>> David Lang
>>>
>>>  On Tue, 24 Sep 2013, David Parker wrote:
>>>
>>>  Date: Tue, 24 Sep 2013 11:48:59 -0400
>>>> From: David Parker <dpar...@utica.edu>
>>>> Reply-To: The Pacemaker cluster resource manager
>>>>     <pacemaker@oss.clusterlabs.org**>
>>>> To: The Pacemaker cluster resource manager <
>>>> pacemaker@oss.clusterlabs.org**>
>>>> Subject: Re: [Pacemaker] Corosync won't recover when a node fails
>>>>
>>>>
>>>> I forgot to mention, OS is Debian Wheezy 64-bit, Corosync and Pacemaker
>>>> installed from packages via apt-get, and there are no local firewall
>>>> rules
>>>> in place:
>>>>
>>>> # iptables -L
>>>> Chain INPUT (policy ACCEPT)
>>>> target     prot opt source               destination
>>>>
>>>> Chain FORWARD (policy ACCEPT)
>>>> target     prot opt source               destination
>>>>
>>>> Chain OUTPUT (policy ACCEPT)
>>>> target     prot opt source               destination
>>>>
>>>>
>>>> On Tue, Sep 24, 2013 at 11:41 AM, David Parker <dpar...@utica.edu>
>>>> wrote:
>>>>
>>>>  Hello,
>>>>>
>>>>> I have a 2-node cluster using Corosync and Pacemaker, where the nodes
>>>>> are
>>>>> actually to VirtualBox VMs on the same physical machine.  I have some
>>>>> resources set up in Pacemaker, and everything works fine if I move
>>>>> them in
>>>>> a controlled way with the "crm_resource -r <resource> --move --node
>>>>> <node>"
>>>>> command.
>>>>>
>>>>> However, when I hard-fail one of the nodes via the "poweroff" command
>>>>> in
>>>>> Virtual Box, which "pulls the plug" on the VM, the resources do not
>>>>> move,
>>>>> and I see the following output in the log on the remaining node:
>>>>>
>>>>> Sep 24 11:20:30 corosync [TOTEM ] The token was lost in the OPERATIONAL
>>>>> state.
>>>>> Sep 24 11:20:30 corosync [TOTEM ] A processor failed, forming new
>>>>> configuration.
>>>>> Sep 24 11:20:30 corosync [TOTEM ] entering GATHER state from 2.
>>>>> Sep 24 11:20:31 test-vm-2 lrmd: [2503]: debug: rsc:drbd_r0:0
>>>>> monitor[31]
>>>>> (pid 8495)
>>>>> drbd[8495]:     2013/09/24_11:20:31 WARNING: This resource agent is
>>>>> deprecated and may be removed in a future release. See the man page for
>>>>> details. To suppress this warning, set the "ignore_deprecation"
>>>>> resource
>>>>> parameter to true.
>>>>> drbd[8495]:     2013/09/24_11:20:31 WARNING: This resource agent is
>>>>> deprecated and may be removed in a future release. See the man page for
>>>>> details. To suppress this warning, set the "ignore_deprecation"
>>>>> resource
>>>>> parameter to true.
>>>>> drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0: Calling drbdadm -c
>>>>> /etc/drbd.conf role r0
>>>>> drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0: Exit code 0
>>>>> drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0: Command output:
>>>>> Secondary/Primary
>>>>> drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0: Calling drbdadm -c
>>>>> /etc/drbd.conf cstate r0
>>>>> drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0: Exit code 0
>>>>> drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0: Command output:
>>>>> Connected
>>>>> drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0 status: Secondary/Primary
>>>>> Secondary Primary Connected
>>>>> Sep 24 11:20:31 test-vm-2 lrmd: [2503]: info: operation monitor[31] on
>>>>> drbd_r0:0 for client 2506: pid 8495 exited with return code 0
>>>>> Sep 24 11:20:32 corosync [TOTEM ] entering GATHER state from 0.
>>>>> Sep 24 11:20:34 corosync [TOTEM ] The consensus timeout expired.
>>>>> Sep 24 11:20:34 corosync [TOTEM ] entering GATHER state from 3.
>>>>> Sep 24 11:20:36 corosync [TOTEM ] The consensus timeout expired.
>>>>> Sep 24 11:20:36 corosync [TOTEM ] entering GATHER state from 3.
>>>>> Sep 24 11:20:38 corosync [TOTEM ] The consensus timeout expired.
>>>>> Sep 24 11:20:38 corosync [TOTEM ] entering GATHER state from 3.
>>>>> Sep 24 11:20:40 corosync [TOTEM ] The consensus timeout expired.
>>>>> Sep 24 11:20:40 corosync [TOTEM ] entering GATHER state from 3.
>>>>> Sep 24 11:20:40 corosync [TOTEM ] Totem is unable to form a cluster
>>>>> because of an operating system or network fault. The most common cause
>>>>> of
>>>>> this message is that the local firewall is configured improperly.
>>>>> Sep 24 11:20:43 corosync [TOTEM ] The consensus timeout expired.
>>>>> Sep 24 11:20:43 corosync [TOTEM ] entering GATHER state from 3.
>>>>> Sep 24 11:20:43 corosync [TOTEM ] Totem is unable to form a cluster
>>>>> because of an operating system or network fault. The most common cause
>>>>> of
>>>>> this message is that the local firewall is configured improperly.
>>>>> Sep 24 11:20:45 corosync [TOTEM ] The consensus timeout expired.
>>>>> Sep 24 11:20:45 corosync [TOTEM ] entering GATHER state from 3.
>>>>> Sep 24 11:20:45 corosync [TOTEM ] Totem is unable to form a cluster
>>>>> because of an operating system or network fault. The most common cause
>>>>> of
>>>>> this message is that the local firewall is configured improperly.
>>>>> Sep 24 11:20:47 corosync [TOTEM ] The consensus timeout expired.
>>>>>
>>>>> Those last 3 messages just repeat over and over, the cluster never
>>>>> recovers, and the resources never move.  "crm_mon" reports that the
>>>>> resources are still running on the dead node, and shows no indication
>>>>> that
>>>>> anything has gone wrong.
>>>>>
>>>>> Does anyone know what the issue could be?  My expectation was that the
>>>>> remaining node would become the sole member of the cluster, take over
>>>>> the
>>>>> resources, and everything would keep running.
>>>>>
>>>>> For reference, my corosync.conf file is below:
>>>>>
>>>>> compatibility: whitetank
>>>>>
>>>>> totem {
>>>>>         version: 2
>>>>>         secauth: off
>>>>>         interface {
>>>>>                 member {
>>>>>                         memberaddr: 192.168.25.201
>>>>>                 }
>>>>>                 member {
>>>>>                         memberaddr: 192.168.25.202
>>>>>                  }
>>>>>                 ringnumber: 0
>>>>>                 bindnetaddr: 192.168.25.0
>>>>>                 mcastport: 5405
>>>>>         }
>>>>>         transport: udpu
>>>>> }
>>>>>
>>>>> logging {
>>>>>         fileline: off
>>>>>         to_logfile: yes
>>>>>         to_syslog: yes
>>>>>         debug: on
>>>>>         logfile: /var/log/cluster/corosync.log
>>>>>         timestamp: on
>>>>>         logger_subsys {
>>>>>                 subsys: AMF
>>>>>                 debug: on
>>>>>         }
>>>>> }
>>>>>
>>>>>
>>>>> Thanks!
>>>>> Dave
>>>>>
>>>>> --
>>>>> Dave Parker
>>>>> Systems Administrator
>>>>> Utica College
>>>>> Integrated Information Technology Services
>>>>> (315) 792-3229
>>>>> Registered Linux User #408177
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>> _______________________________________________
>>>
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>>
>>>
>>> Project Home: http://www.clusterlabs.org
>>>
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>
>>
>> --
>> Dave Parker
>> Systems Administrator
>> Utica College
>> Integrated Information Technology Services
>> (315) 792-3229
>> Registered Linux User #408177
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>


-- 
Dave Parker
Systems Administrator
Utica College
Integrated Information Technology Services
(315) 792-3229
Registered Linux User #408177

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Corosync won't recover when a node fails

Reply via email to