Re: [Pacemaker] Corosync won't recover when a node fails

David Lang Wed, 25 Sep 2013 03:14:27 -0700

the cluster is trying to reach a quarum (the majority of the nodes talking toeach other) and that is never going to happen with only one node. so you have todisable this.


try putting
<cman two_node="1" expected_votes="1" transport="udpu"/>
in your cluster.conf


David Lang

 On Tue, 24 Sep 2013, David Parker wrote:

Date: Tue, 24 Sep 2013 11:48:59 -0400
From: David Parker <dpar...@utica.edu>
Reply-To: The Pacemaker cluster resource manager
    <pacemaker@oss.clusterlabs.org>
To: The Pacemaker cluster resource manager <pacemaker@oss.clusterlabs.org>
Subject: Re: [Pacemaker] Corosync won't recover when a node fails

I forgot to mention, OS is Debian Wheezy 64-bit, Corosync and Pacemaker
installed from packages via apt-get, and there are no local firewall rules
in place:

# iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination


On Tue, Sep 24, 2013 at 11:41 AM, David Parker <dpar...@utica.edu> wrote:

Hello,

I have a 2-node cluster using Corosync and Pacemaker, where the nodes are
actually to VirtualBox VMs on the same physical machine.  I have some
resources set up in Pacemaker, and everything works fine if I move them in
a controlled way with the "crm_resource -r <resource> --move --node <node>"
command.

However, when I hard-fail one of the nodes via the "poweroff" command in
Virtual Box, which "pulls the plug" on the VM, the resources do not move,
and I see the following output in the log on the remaining node:

Sep 24 11:20:30 corosync [TOTEM ] The token was lost in the OPERATIONAL
state.
Sep 24 11:20:30 corosync [TOTEM ] A processor failed, forming new
configuration.
Sep 24 11:20:30 corosync [TOTEM ] entering GATHER state from 2.
Sep 24 11:20:31 test-vm-2 lrmd: [2503]: debug: rsc:drbd_r0:0 monitor[31]
(pid 8495)
drbd[8495]:     2013/09/24_11:20:31 WARNING: This resource agent is
deprecated and may be removed in a future release. See the man page for
details. To suppress this warning, set the "ignore_deprecation" resource
parameter to true.
drbd[8495]:     2013/09/24_11:20:31 WARNING: This resource agent is
deprecated and may be removed in a future release. See the man page for
details. To suppress this warning, set the "ignore_deprecation" resource
parameter to true.
drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0: Calling drbdadm -c
/etc/drbd.conf role r0
drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0: Exit code 0
drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0: Command output:
Secondary/Primary
drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0: Calling drbdadm -c
/etc/drbd.conf cstate r0
drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0: Exit code 0
drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0: Command output: Connected
drbd[8495]:     2013/09/24_11:20:31 DEBUG: r0 status: Secondary/Primary
Secondary Primary Connected
Sep 24 11:20:31 test-vm-2 lrmd: [2503]: info: operation monitor[31] on
drbd_r0:0 for client 2506: pid 8495 exited with return code 0
Sep 24 11:20:32 corosync [TOTEM ] entering GATHER state from 0.
Sep 24 11:20:34 corosync [TOTEM ] The consensus timeout expired.
Sep 24 11:20:34 corosync [TOTEM ] entering GATHER state from 3.
Sep 24 11:20:36 corosync [TOTEM ] The consensus timeout expired.
Sep 24 11:20:36 corosync [TOTEM ] entering GATHER state from 3.
Sep 24 11:20:38 corosync [TOTEM ] The consensus timeout expired.
Sep 24 11:20:38 corosync [TOTEM ] entering GATHER state from 3.
Sep 24 11:20:40 corosync [TOTEM ] The consensus timeout expired.
Sep 24 11:20:40 corosync [TOTEM ] entering GATHER state from 3.
Sep 24 11:20:40 corosync [TOTEM ] Totem is unable to form a cluster
because of an operating system or network fault. The most common cause of
this message is that the local firewall is configured improperly.
Sep 24 11:20:43 corosync [TOTEM ] The consensus timeout expired.
Sep 24 11:20:43 corosync [TOTEM ] entering GATHER state from 3.
Sep 24 11:20:43 corosync [TOTEM ] Totem is unable to form a cluster
because of an operating system or network fault. The most common cause of
this message is that the local firewall is configured improperly.
Sep 24 11:20:45 corosync [TOTEM ] The consensus timeout expired.
Sep 24 11:20:45 corosync [TOTEM ] entering GATHER state from 3.
Sep 24 11:20:45 corosync [TOTEM ] Totem is unable to form a cluster
because of an operating system or network fault. The most common cause of
this message is that the local firewall is configured improperly.
Sep 24 11:20:47 corosync [TOTEM ] The consensus timeout expired.

Those last 3 messages just repeat over and over, the cluster never
recovers, and the resources never move.  "crm_mon" reports that the
resources are still running on the dead node, and shows no indication that
anything has gone wrong.

Does anyone know what the issue could be?  My expectation was that the
remaining node would become the sole member of the cluster, take over the
resources, and everything would keep running.

For reference, my corosync.conf file is below:

compatibility: whitetank

totem {
        version: 2
        secauth: off
        interface {
                member {
                        memberaddr: 192.168.25.201
                }
                member {
                        memberaddr: 192.168.25.202
                 }
                ringnumber: 0
                bindnetaddr: 192.168.25.0
                mcastport: 5405
        }
        transport: udpu
}

logging {
        fileline: off
        to_logfile: yes
        to_syslog: yes
        debug: on
        logfile: /var/log/cluster/corosync.log
        timestamp: on
        logger_subsys {
                subsys: AMF
                debug: on
        }
}


Thanks!
Dave

--
Dave Parker
Systems Administrator
Utica College
Integrated Information Technology Services
(315) 792-3229
Registered Linux User #408177

_______________________________________________

Pacemaker mailing list: Pacemaker@oss.clusterlabs.org

http://oss.clusterlabs.org/mailman/listinfo/pacemaker



Project Home: http://www.clusterlabs.org

Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Corosync won't recover when a node fails

Reply via email to