OpenAIS problems

S. A. Woltering Wed, 04 Nov 2009 02:40:28 -0800

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello,
I hope I'm not posting anything new to the list, but I probably am.


I'm in the process of building up a two-node cluster based on DRBD,
Pacemaker and OpenAIS.

I've 800GB+200GB RAID partitions of each of two HP DL360s (the 800GB
allocated to resource "mail", the 200GB to resource "rsync"). Running
CentOS 5.4, with Pacemaker set up as detailed in Andrew Beekhof's
"Cluster from scratch - Apache" document. x86_64 architecture.

My DRBD config is as follows:
# /etc/drbd.conf
common {
    protocol               C;
    net {
        allow-two-primaries;
        cram-hmac-alg    sha1;
        shared-secret    XXXXXX;
        after-sb-0pri    discard-zero-changes;
        after-sb-1pri    discard-secondary;
        after-sb-2pri    disconnect;
    }
    disk {
        fencing          resource-only;
    }
    syncer {
        rate             100M;
        verify-alg       sha1;
    }
    startup {
        wfc-timeout       20;
        degr-wfc-timeout  10;
    }
    handlers {
        fence-peer       /usr/lib/drbd/crm-fence-peer.sh;
        after-resync-target /usr/lib/drbd/crm-unfence-peer.sh;
    }
}
resource mail {
    on gemini {
        device           /dev/drbd0 minor 0;
        disk             /dev/cciss/c0d0p3;
        address          ipv4 XX.hb1.addy.xx:7789;
        meta-disk        internal;
    }
    on soyuz {
        device           /dev/drbd0 minor 0;
        disk             /dev/cciss/c0d0p3;
        address          ipv4 XX.hb1.addy.xx:7789;
        meta-disk        internal;
    }
}
resource rsync {
    on gemini {
        device           /dev/drbd1 minor 1;
        disk             /dev/cciss/c0d0p4;
        address          ipv4 XX.hb2.addy.xx:7789;
        meta-disk        internal;
    }
    on soyuz {
        device           /dev/drbd1 minor 1;
        disk             /dev/cciss/c0d0p4;
        address          ipv4 XX.hb2.addy.xx:7789;
        meta-disk        internal;
    }
}

Output from "crm configure show" is:

node gemini \
        attributes standby="off"
node soyuz \
        attributes standby="off"
primitive drbd-mail ocf:linbit:drbd \
        params drbd_resource="mail" \
        op monitor interval="15s"
primitive fs-mail ocf:heartbeat:Filesystem \
        params device="/dev/drbd0" directory="/data/mail" fstype="ext3"
primitive ip-mail ocf:heartbeat:IPaddr2 \
        params ip="xxx.xxx.xxx.xxx" nic="bond0"
primitive st-gemini stonith:external/riloe \
        params hostlist="gemini" ilo_hostname="xxx.xxx.xxx.xxx"
ilo_user="root" ilo_password="XXXXXX" ilo_can_reset="0"
ilo_protocol="2.0" ilo_powerdown_method="button -S" \
        op monitor interval="60s"
primitive st-soyuz stonith:external/riloe \
        params hostlist="soyuz" ilo_hostname="xxx.xxx.xxx.xxx"
ilo_user="root" ilo_password="XXXXXX" ilo_can_reset="0"
ilo_protocol="2.0" ilo_powerdown_method="button -S" \
        op monitor interval="60s"
group mailservice fs-mail ip-mail \
        meta target-role="Started"
ms ms-drbd-mail drbd-mail \
        meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true" target-role="Started"
##########I didn't set this... should I delete it?
location drbd-fence-by-handler-ms-drbd-mail ms-drbd-mail \
        rule $id="drbd-fence-by-handler-rule-ms-drbd-mail"
$role="Master" -inf: #uname ne soyuz
##########
location l-st-gemini st-gemini -inf: gemini
location l-st-soyuz st-soyuz -inf: soyuz
colocation mail-on-drbd inf: mailservice ms-drbd-mail:Master
order mail-after-drbd inf: ms-drbd-mail:promote mailservice:start
property $id="cib-bootstrap-options" \
        dc-version="1.0.5-462f1569a43740667daf7b0f6b521742e9eb8fa7" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        stonith-enabled="true" \
        no-quorum-policy="ignore" \
        last-lrm-refresh="1257329728"
rsc_defaults $id="rsc-options" \
        resource-stickiness="200"

DRBD, on it's own works fine. However, when using the OCF agent as shown
in the output above, I see some strange effects.

If I perform a "destructive" test on one node (ie: yank the power out),
everything failed over smoothly, but when I brought the downed node back
online, it refused to reconnect to the DRBD "mail" resource.

I get the following from "crm_mon -1":
============
Last updated: Wed Nov  4 10:05:48 2009
Stack: openais
Current DC: soyuz - partition with quorum
Version: 1.0.5-462f1569a43740667daf7b0f6b521742e9eb8fa7
2 Nodes configured, 2 expected votes
4 Resources configured.
============

Online: [ soyuz gemini ]

st-gemini       (stonith:external/riloe):       Started soyuz
st-soyuz        (stonith:external/riloe):       Started gemini

Failed actions:
    drbd-mail:1_start_0 (node=gemini, call=8, rc=-2, status=Timed Out):
unknown

So, I manually, re-attach and re-connect to the resource on the "quiet"
node and I see this:

# crm resource show
st-gemini       (stonith:external/riloe) Started
st-soyuz        (stonith:external/riloe) Stopped
Master/Slave Set: ms-drbd-mail
        Masters: [ soyuz ]
        Stopped: [ drbd-mail:1 ]
Resource Group: mailservice
    fs-mail     (ocf::heartbeat:Filesystem) Stopped
    ip-mail     (ocf::heartbeat:IPaddr2) Stopped

After about half an hour (and clearing up failure messages) it sorts
itself out and starts working again correctly.

Can anyone offer some advice as to why it might be doing this, please?
Thanks,
Ashley
- --
Ashley Woltering, Systems Analyst,
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.4-svn0 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iD8DBQFK8VoPh854NVK99FMRAlHBAJ9vum7mZYteZeXjai6fIt4JhHvrOACdHJ2i
ReEW/RmM9YOnV9y2UN1ncAA=
=z1nn
-----END PGP SIGNATURE-----
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] DRBD/Pacemaker/OpenAIS problems

Reply via email to