Re: [Pacemaker] Cluster Refuses to Stop/Shutdown

Remi Broemeling Thu, 24 Sep 2009 16:30:03 -0700

Ok, thanks for the note Steven. I've filed the bug, it is #525589.

Steven Dake wrote:

Remi,

Likely a defect.  We will have to look into it.  Please file a bug as
per instructions on the corosync wiki at www.corosync.org.


On Thu, 2009-09-24 at 16:47 -0600, Remi Broemeling wrote:

I've spent all day working on this; even going so far as to completely
build my own set of packages from the Debian-available ones (which
appear to be different than the Ubuntu-available ones).  It didn't
have any effect on the issue at all: the cluster still freaks out and
becomes a split-brain after a single SIGQUIT.

The debian packages that also demonstrate this behavior were the below
versions:
    cluster-glue_1.0+hg20090915-1~bpo50+1_i386.deb
    corosync_1.0.0-5~bpo50+1_i386.deb
    libcorosync4_1.0.0-5~bpo50+1_i386.deb
    libopenais3_1.0.0-4~bpo50+1_i386.deb
    openais_1.0.0-4~bpo50+1_i386.deb
    pacemaker-openais_1.0.5+hg20090915-1~bpo50+1_i386.deb

These packages were re-built (under Ubuntu Hardy Heron LTS) from the
*.diff.gz, *.dsc, and *.orig.tar.gz files available at
http://people.debian.org/~madkiss/ha-corosync, and as I said the
symptoms remain exactly the same, both under the configuration that I
list below and the sample configuration that came with these packages.
I also attempted the same with a single IP Address resource associated
with the cluster; just to be sure it wasn't an edge case for a cluster
with no resources; but again that had no effect.

Basically I'm still exactly at the point that I was at yesterday
morning at about 0900.

Remi Broemeling wrote:

I posted this to the OpenAIS Mailing List
(open...@lists.linux-foundation.org) yesterday, but haven't received
a response and upon further reflection I think that maybe I chose
the wrong list to post it to.  That list seems to be far less about
user support and far more about developer communication.  Therefore
re-trying here, as the archives show it to be somewhat more
user-focused.

The problem is that I'm having an issue with corosync refusing to
shutdown in response to a QUIT signal.  Given the below cluster
(output of crm_mon):

============
Last updated: Wed Sep 23 15:56:24 2009
Stack: openais
Current DC: boot1 - partition with quorum
Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
2 Nodes configured, 2 expected votes
0 Resources configured.
============

Online: [ boot1 boot2 ]

If I go onto the host 'boot2', and issue the command "killall -QUIT
corosync", the anticipated result would be that boot2 would go
offline (out of the cluster), and all of the cluster processes
(corosync/stonithd/cib/lrmd/attrd/pengine/crmd) would shut-down.
However, this is not occurring, and I don't really have any idea
why.  After logging into boot2, and issuing the command "killall
-QUIT corosync", the result is a split-brain:

>From boot1's viewpoint:
============
Last updated: Wed Sep 23 15:58:27 2009
Stack: openais
Current DC: boot1 - partition WITHOUT quorum
Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
2 Nodes configured, 2 expected votes
0 Resources configured.
============

Online: [ boot1 ]
OFFLINE: [ boot2 ]

>From boot2's viewpoint:
============
Last updated: Wed Sep 23 15:58:35 2009
Stack: openais
Current DC: boot1 - partition with quorum
Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
2 Nodes configured, 2 expected votes
0 Resources configured.
============

Online: [ boot1 boot2 ]

At this point the status quo holds until such time as ANOTHER QUIT
signal is sent to corosync, (i.e. the command "killall -QUIT
corosync" is executed on boot2 again).  Then, boot2 shuts down
properly and everything appears to be kosher.  Basically, what I
expect to happen after a single QUIT signal is instead taking two
QUIT signals to occur; and that summarizes my question: why does it
take two QUIT signals to force corosync to actually shutdown?  Is
that desired behavior?  From everything online that I have read it
seems to be very strange, and it makes me think that I have a
problem in my configuration(s), but I've no idea what that would be
even after playing with things and investigating for the day.

I would be very grateful for any guidance that could be provided, as
at the moment I seem to be at an impasse.

Log files, with debugging set to 'on', can be found at the following
pastebin locations:
    After first QUIT signal issued on boot2:
        boot1:/var/log/syslog: http://pastebin.com/m7f9a61fd
        boot2:/var/log/syslog: http://pastebin.com/d26fdfee
    After second QUIT signal issued on boot2:
        boot1:/var/log/syslog: http://pastebin.com/m755fb989
        boot2:/var/log/syslog: http://pastebin.com/m22dcef45

OS, Software Packages, and Versions:
    * two nodes, each running Ubuntu Hardy Heron LTS
    * ubuntu-ha packages, as downloaded from
http://ppa.launchpad.net/ubuntu-ha-maintainers/ppa/ubuntu/:
        * pacemaker-openais package version 1.0.5
+hg20090813-0ubuntu2~hardy1
        * openais package version 1.0.0-3ubuntu1~hardy1
        * corosync package version 1.0.0-4ubuntu1~hardy2
        * heartbeat-common package version heartbeat-common_2.99.2
+sles11r9-5ubuntu1~hardy1

Network Setup:
    * boot1
        * eth0 is 192.168.10.192
        * eth1 is 172.16.1.1
    * boot2
        * eth0 is 192.168.10.193
        * eth1 is 172.16.1.2
    * boot1:eth0 and boot2:eth0 both connect to the same switch.
    * boot1:eth1 and boot2:eth1 are connected directly to each other
via a cross-over cable.
    * no firewalls are involved, and tcpdump shows the multicast and
UDP traffic flowing correctly over these links.
    * I attempted a broadcast (rather than multicast) configuration,
to see if that would fix the problem.  It did not.

`crm configure show` output:
    node boot1
    node boot2
    property $id="cib-bootstrap-options" \

dc-version="1.0.5-3840e6b5a305ccb803d29b468556739e75532d56" \
            cluster-infrastructure="openais" \
            expected-quorum-votes="2" \
            stonith-enabled="false" \
            no-quorum-policy="ignore"

Contents of /etc/corosync/corosync.conf:
    # Please read the corosync.conf.5 manual page
    compatibility: whitetank

    totem {
        clear_node_high_bit: yes
        version: 2
        secauth: on
        threads: 1
        heartbeat_failures_allowed: 3
        interface {
                ringnumber: 0
                bindnetaddr: 172.16.1.0
                mcastaddr: 239.42.0.1
                mcastport: 5505
        }
        interface {
                ringnumber: 1
                bindnetaddr: 192.168.10.0
                mcastaddr: 239.42.0.2
                mcastport: 6606
        }
        rrp_mode: passive
    }

    amf {
        mode: disabled
    }

    service {
        name: pacemaker
        ver: 0
    }

    aisexec {
        user: root
        group: root
    }

    logging {
        debug: on
        fileline: off
        function_name: off
        to_logfile: no
        to_stderr: no
        to_syslog: yes
        timestamp: on
        logger_subsys {
                subsys: AMF
                debug: off
                tags: enter|leave|trace1|trace2|trace3|trace4|trace6
        }
    }

Remi Broemeling

Sr System Administrator

Nexopia.com Inc.

direct: 780 444 1250 ext 435
email: r...@nexopia.com
fax: 780 487 0376

ICMP: The protocol that goes PING!

www.coolsigs.com

_______________________________________________
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Re: [Pacemaker] Cluster Refuses to Stop/Shutdown

Reply via email to