OpenAIS problems

Dejan Muhamedagic Wed, 04 Nov 2009 04:33:42 -0800

Hi,

On Wed, Nov 04, 2009 at 10:40:15AM +0000, S. A. Woltering wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Hello,
> I hope I'm not posting anything new to the list, but I probably am.
> 
> I'm in the process of building up a two-node cluster based on DRBD,
> Pacemaker and OpenAIS.
> 
> I've 800GB+200GB RAID partitions of each of two HP DL360s (the 800GB
> allocated to resource "mail", the 200GB to resource "rsync"). Running
> CentOS 5.4, with Pacemaker set up as detailed in Andrew Beekhof's
> "Cluster from scratch - Apache" document. x86_64 architecture.
> 
> My DRBD config is as follows:
> # /etc/drbd.conf
> common {
>     protocol               C;
>     net {
>         allow-two-primaries;
>         cram-hmac-alg    sha1;
>         shared-secret    XXXXXX;
>         after-sb-0pri    discard-zero-changes;
>         after-sb-1pri    discard-secondary;
>         after-sb-2pri    disconnect;
>     }
>     disk {
>         fencing          resource-only;
>     }
>     syncer {
>         rate             100M;
>         verify-alg       sha1;
>     }
>     startup {
>         wfc-timeout       20;
>         degr-wfc-timeout  10;
>     }
>     handlers {
>         fence-peer       /usr/lib/drbd/crm-fence-peer.sh;
>         after-resync-target /usr/lib/drbd/crm-unfence-peer.sh;
>     }
> }
> resource mail {
>     on gemini {
>         device           /dev/drbd0 minor 0;
>         disk             /dev/cciss/c0d0p3;
>         address          ipv4 XX.hb1.addy.xx:7789;
>         meta-disk        internal;
>     }
>     on soyuz {
>         device           /dev/drbd0 minor 0;
>         disk             /dev/cciss/c0d0p3;
>         address          ipv4 XX.hb1.addy.xx:7789;
>         meta-disk        internal;
>     }
> }
> resource rsync {
>     on gemini {
>         device           /dev/drbd1 minor 1;
>         disk             /dev/cciss/c0d0p4;
>         address          ipv4 XX.hb2.addy.xx:7789;
>         meta-disk        internal;
>     }
>     on soyuz {
>         device           /dev/drbd1 minor 1;
>         disk             /dev/cciss/c0d0p4;
>         address          ipv4 XX.hb2.addy.xx:7789;
>         meta-disk        internal;
>     }
> }
> 
> Output from "crm configure show" is:
> 
> node gemini \
>         attributes standby="off"
> node soyuz \
>         attributes standby="off"
> primitive drbd-mail ocf:linbit:drbd \
>         params drbd_resource="mail" \
>         op monitor interval="15s"
> primitive fs-mail ocf:heartbeat:Filesystem \
>         params device="/dev/drbd0" directory="/data/mail" fstype="ext3"
> primitive ip-mail ocf:heartbeat:IPaddr2 \
>         params ip="xxx.xxx.xxx.xxx" nic="bond0"
> primitive st-gemini stonith:external/riloe \
>         params hostlist="gemini" ilo_hostname="xxx.xxx.xxx.xxx"
> ilo_user="root" ilo_password="XXXXXX" ilo_can_reset="0"
> ilo_protocol="2.0" ilo_powerdown_method="button -S" \


"button -S" is not supported. Only "button" and "power". The
latter is, I think, more reliable, the former easier on the
hardware. And riloe should issue a warning if it doesn't
recognize the method.

>         op monitor interval="60s"
> primitive st-soyuz stonith:external/riloe \
>         params hostlist="soyuz" ilo_hostname="xxx.xxx.xxx.xxx"
> ilo_user="root" ilo_password="XXXXXX" ilo_can_reset="0"
> ilo_protocol="2.0" ilo_powerdown_method="button -S" \
>         op monitor interval="60s"
> group mailservice fs-mail ip-mail \
>         meta target-role="Started"
> ms ms-drbd-mail drbd-mail \
>         meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true" target-role="Started"
> ##########I didn't set this... should I delete it?
> location drbd-fence-by-handler-ms-drbd-mail ms-drbd-mail \
>         rule $id="drbd-fence-by-handler-rule-ms-drbd-mail"
> $role="Master" -inf: #uname ne soyuz
> ##########
> location l-st-gemini st-gemini -inf: gemini
> location l-st-soyuz st-soyuz -inf: soyuz
> colocation mail-on-drbd inf: mailservice ms-drbd-mail:Master
> order mail-after-drbd inf: ms-drbd-mail:promote mailservice:start
> property $id="cib-bootstrap-options" \
>         dc-version="1.0.5-462f1569a43740667daf7b0f6b521742e9eb8fa7" \
>         cluster-infrastructure="openais" \
>         expected-quorum-votes="2" \
>         stonith-enabled="true" \
>         no-quorum-policy="ignore" \
>         last-lrm-refresh="1257329728"
> rsc_defaults $id="rsc-options" \
>         resource-stickiness="200"
> 
> DRBD, on it's own works fine. However, when using the OCF agent as shown
> in the output above, I see some strange effects.
> 
> If I perform a "destructive" test on one node (ie: yank the power out),
> everything failed over smoothly, but when I brought the downed node back
> online, it refused to reconnect to the DRBD "mail" resource.
> 
> I get the following from "crm_mon -1":
> ============
> Last updated: Wed Nov  4 10:05:48 2009
> Stack: openais
> Current DC: soyuz - partition with quorum
> Version: 1.0.5-462f1569a43740667daf7b0f6b521742e9eb8fa7
> 2 Nodes configured, 2 expected votes
> 4 Resources configured.
> ============
> 
> Online: [ soyuz gemini ]
> 
> st-gemini       (stonith:external/riloe):       Started soyuz
> st-soyuz        (stonith:external/riloe):       Started gemini
> 
> Failed actions:
>     drbd-mail:1_start_0 (node=gemini, call=8, rc=-2, status=Timed Out):
> unknown
> 
> So, I manually, re-attach and re-connect to the resource on the "quiet"
> node and I see this:
> 
> # crm resource show
> st-gemini       (stonith:external/riloe) Started
> st-soyuz        (stonith:external/riloe) Stopped
> Master/Slave Set: ms-drbd-mail
>         Masters: [ soyuz ]
>         Stopped: [ drbd-mail:1 ]
> Resource Group: mailservice
>     fs-mail     (ocf::heartbeat:Filesystem) Stopped
>     ip-mail     (ocf::heartbeat:IPaddr2) Stopped
> 
> After about half an hour (and clearing up failure messages) it sorts
> itself out and starts working again correctly.

You have to cleanup resources after manual intervention.

> Can anyone offer some advice as to why it might be doing this, please?

This is two issues: a) Why the drbd resource can't start after
split-brain and b) Why does it take such a long time for the
cluster to recover after resource recovery and cleanup.

The a) is, I guess, a configuration issue, i.e. it depends on did
you choose automatic or manual split-brain recovery. Just
guessing, not an expert on drbd, but there should be very good
documentation at linbit's site.

For b) there is not enough information, i.e. a hb_report would be
required. Or was it that you ran resource cleanup only later?

Thanks,

Dejan

> Thanks,
> Ashley
> - --
> Ashley Woltering, Systems Analyst,
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.4-svn0 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> 
> iD8DBQFK8VoPh854NVK99FMRAlHBAJ9vum7mZYteZeXjai6fIt4JhHvrOACdHJ2i
> ReEW/RmM9YOnV9y2UN1ncAA=
> =z1nn
> -----END PGP SIGNATURE-----
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] DRBD/Pacemaker/OpenAIS problems

Reply via email to