Hi,
On Wed, Nov 04, 2009 at 10:40:15AM +0000, S. A. Woltering wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hello,
> I hope I'm not posting anything new to the list, but I probably am.
>
> I'm in the process of building up a two-node cluster based on DRBD,
> Pacemaker and OpenAIS.
>
> I've 800GB+200GB RAID partitions of each of two HP DL360s (the 800GB
> allocated to resource "mail", the 200GB to resource "rsync"). Running
> CentOS 5.4, with Pacemaker set up as detailed in Andrew Beekhof's
> "Cluster from scratch - Apache" document. x86_64 architecture.
>
> My DRBD config is as follows:
> # /etc/drbd.conf
> common {
> protocol C;
> net {
> allow-two-primaries;
> cram-hmac-alg sha1;
> shared-secret XXXXXX;
> after-sb-0pri discard-zero-changes;
> after-sb-1pri discard-secondary;
> after-sb-2pri disconnect;
> }
> disk {
> fencing resource-only;
> }
> syncer {
> rate 100M;
> verify-alg sha1;
> }
> startup {
> wfc-timeout 20;
> degr-wfc-timeout 10;
> }
> handlers {
> fence-peer /usr/lib/drbd/crm-fence-peer.sh;
> after-resync-target /usr/lib/drbd/crm-unfence-peer.sh;
> }
> }
> resource mail {
> on gemini {
> device /dev/drbd0 minor 0;
> disk /dev/cciss/c0d0p3;
> address ipv4 XX.hb1.addy.xx:7789;
> meta-disk internal;
> }
> on soyuz {
> device /dev/drbd0 minor 0;
> disk /dev/cciss/c0d0p3;
> address ipv4 XX.hb1.addy.xx:7789;
> meta-disk internal;
> }
> }
> resource rsync {
> on gemini {
> device /dev/drbd1 minor 1;
> disk /dev/cciss/c0d0p4;
> address ipv4 XX.hb2.addy.xx:7789;
> meta-disk internal;
> }
> on soyuz {
> device /dev/drbd1 minor 1;
> disk /dev/cciss/c0d0p4;
> address ipv4 XX.hb2.addy.xx:7789;
> meta-disk internal;
> }
> }
>
> Output from "crm configure show" is:
>
> node gemini \
> attributes standby="off"
> node soyuz \
> attributes standby="off"
> primitive drbd-mail ocf:linbit:drbd \
> params drbd_resource="mail" \
> op monitor interval="15s"
> primitive fs-mail ocf:heartbeat:Filesystem \
> params device="/dev/drbd0" directory="/data/mail" fstype="ext3"
> primitive ip-mail ocf:heartbeat:IPaddr2 \
> params ip="xxx.xxx.xxx.xxx" nic="bond0"
> primitive st-gemini stonith:external/riloe \
> params hostlist="gemini" ilo_hostname="xxx.xxx.xxx.xxx"
> ilo_user="root" ilo_password="XXXXXX" ilo_can_reset="0"
> ilo_protocol="2.0" ilo_powerdown_method="button -S" \
"button -S" is not supported. Only "button" and "power". The
latter is, I think, more reliable, the former easier on the
hardware. And riloe should issue a warning if it doesn't
recognize the method.
> op monitor interval="60s"
> primitive st-soyuz stonith:external/riloe \
> params hostlist="soyuz" ilo_hostname="xxx.xxx.xxx.xxx"
> ilo_user="root" ilo_password="XXXXXX" ilo_can_reset="0"
> ilo_protocol="2.0" ilo_powerdown_method="button -S" \
> op monitor interval="60s"
> group mailservice fs-mail ip-mail \
> meta target-role="Started"
> ms ms-drbd-mail drbd-mail \
> meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true" target-role="Started"
> ##########I didn't set this... should I delete it?
> location drbd-fence-by-handler-ms-drbd-mail ms-drbd-mail \
> rule $id="drbd-fence-by-handler-rule-ms-drbd-mail"
> $role="Master" -inf: #uname ne soyuz
> ##########
> location l-st-gemini st-gemini -inf: gemini
> location l-st-soyuz st-soyuz -inf: soyuz
> colocation mail-on-drbd inf: mailservice ms-drbd-mail:Master
> order mail-after-drbd inf: ms-drbd-mail:promote mailservice:start
> property $id="cib-bootstrap-options" \
> dc-version="1.0.5-462f1569a43740667daf7b0f6b521742e9eb8fa7" \
> cluster-infrastructure="openais" \
> expected-quorum-votes="2" \
> stonith-enabled="true" \
> no-quorum-policy="ignore" \
> last-lrm-refresh="1257329728"
> rsc_defaults $id="rsc-options" \
> resource-stickiness="200"
>
> DRBD, on it's own works fine. However, when using the OCF agent as shown
> in the output above, I see some strange effects.
>
> If I perform a "destructive" test on one node (ie: yank the power out),
> everything failed over smoothly, but when I brought the downed node back
> online, it refused to reconnect to the DRBD "mail" resource.
>
> I get the following from "crm_mon -1":
> ============
> Last updated: Wed Nov 4 10:05:48 2009
> Stack: openais
> Current DC: soyuz - partition with quorum
> Version: 1.0.5-462f1569a43740667daf7b0f6b521742e9eb8fa7
> 2 Nodes configured, 2 expected votes
> 4 Resources configured.
> ============
>
> Online: [ soyuz gemini ]
>
> st-gemini (stonith:external/riloe): Started soyuz
> st-soyuz (stonith:external/riloe): Started gemini
>
> Failed actions:
> drbd-mail:1_start_0 (node=gemini, call=8, rc=-2, status=Timed Out):
> unknown
>
> So, I manually, re-attach and re-connect to the resource on the "quiet"
> node and I see this:
>
> # crm resource show
> st-gemini (stonith:external/riloe) Started
> st-soyuz (stonith:external/riloe) Stopped
> Master/Slave Set: ms-drbd-mail
> Masters: [ soyuz ]
> Stopped: [ drbd-mail:1 ]
> Resource Group: mailservice
> fs-mail (ocf::heartbeat:Filesystem) Stopped
> ip-mail (ocf::heartbeat:IPaddr2) Stopped
>
> After about half an hour (and clearing up failure messages) it sorts
> itself out and starts working again correctly.
You have to cleanup resources after manual intervention.
> Can anyone offer some advice as to why it might be doing this, please?
This is two issues: a) Why the drbd resource can't start after
split-brain and b) Why does it take such a long time for the
cluster to recover after resource recovery and cleanup.
The a) is, I guess, a configuration issue, i.e. it depends on did
you choose automatic or manual split-brain recovery. Just
guessing, not an expert on drbd, but there should be very good
documentation at linbit's site.
For b) there is not enough information, i.e. a hb_report would be
required. Or was it that you ran resource cleanup only later?
Thanks,
Dejan
> Thanks,
> Ashley
> - --
> Ashley Woltering, Systems Analyst,
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.4-svn0 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iD8DBQFK8VoPh854NVK99FMRAlHBAJ9vum7mZYteZeXjai6fIt4JhHvrOACdHJ2i
> ReEW/RmM9YOnV9y2UN1ncAA=
> =z1nn
> -----END PGP SIGNATURE-----
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems