-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hello,
I hope I'm not posting anything new to the list, but I probably am.
I'm in the process of building up a two-node cluster based on DRBD,
Pacemaker and OpenAIS.
I've 800GB+200GB RAID partitions of each of two HP DL360s (the 800GB
allocated to resource "mail", the 200GB to resource "rsync"). Running
CentOS 5.4, with Pacemaker set up as detailed in Andrew Beekhof's
"Cluster from scratch - Apache" document. x86_64 architecture.
My DRBD config is as follows:
# /etc/drbd.conf
common {
protocol C;
net {
allow-two-primaries;
cram-hmac-alg sha1;
shared-secret XXXXXX;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}
disk {
fencing resource-only;
}
syncer {
rate 100M;
verify-alg sha1;
}
startup {
wfc-timeout 20;
degr-wfc-timeout 10;
}
handlers {
fence-peer /usr/lib/drbd/crm-fence-peer.sh;
after-resync-target /usr/lib/drbd/crm-unfence-peer.sh;
}
}
resource mail {
on gemini {
device /dev/drbd0 minor 0;
disk /dev/cciss/c0d0p3;
address ipv4 XX.hb1.addy.xx:7789;
meta-disk internal;
}
on soyuz {
device /dev/drbd0 minor 0;
disk /dev/cciss/c0d0p3;
address ipv4 XX.hb1.addy.xx:7789;
meta-disk internal;
}
}
resource rsync {
on gemini {
device /dev/drbd1 minor 1;
disk /dev/cciss/c0d0p4;
address ipv4 XX.hb2.addy.xx:7789;
meta-disk internal;
}
on soyuz {
device /dev/drbd1 minor 1;
disk /dev/cciss/c0d0p4;
address ipv4 XX.hb2.addy.xx:7789;
meta-disk internal;
}
}
Output from "crm configure show" is:
node gemini \
attributes standby="off"
node soyuz \
attributes standby="off"
primitive drbd-mail ocf:linbit:drbd \
params drbd_resource="mail" \
op monitor interval="15s"
primitive fs-mail ocf:heartbeat:Filesystem \
params device="/dev/drbd0" directory="/data/mail" fstype="ext3"
primitive ip-mail ocf:heartbeat:IPaddr2 \
params ip="xxx.xxx.xxx.xxx" nic="bond0"
primitive st-gemini stonith:external/riloe \
params hostlist="gemini" ilo_hostname="xxx.xxx.xxx.xxx"
ilo_user="root" ilo_password="XXXXXX" ilo_can_reset="0"
ilo_protocol="2.0" ilo_powerdown_method="button -S" \
op monitor interval="60s"
primitive st-soyuz stonith:external/riloe \
params hostlist="soyuz" ilo_hostname="xxx.xxx.xxx.xxx"
ilo_user="root" ilo_password="XXXXXX" ilo_can_reset="0"
ilo_protocol="2.0" ilo_powerdown_method="button -S" \
op monitor interval="60s"
group mailservice fs-mail ip-mail \
meta target-role="Started"
ms ms-drbd-mail drbd-mail \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true" target-role="Started"
##########I didn't set this... should I delete it?
location drbd-fence-by-handler-ms-drbd-mail ms-drbd-mail \
rule $id="drbd-fence-by-handler-rule-ms-drbd-mail"
$role="Master" -inf: #uname ne soyuz
##########
location l-st-gemini st-gemini -inf: gemini
location l-st-soyuz st-soyuz -inf: soyuz
colocation mail-on-drbd inf: mailservice ms-drbd-mail:Master
order mail-after-drbd inf: ms-drbd-mail:promote mailservice:start
property $id="cib-bootstrap-options" \
dc-version="1.0.5-462f1569a43740667daf7b0f6b521742e9eb8fa7" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
stonith-enabled="true" \
no-quorum-policy="ignore" \
last-lrm-refresh="1257329728"
rsc_defaults $id="rsc-options" \
resource-stickiness="200"
DRBD, on it's own works fine. However, when using the OCF agent as shown
in the output above, I see some strange effects.
If I perform a "destructive" test on one node (ie: yank the power out),
everything failed over smoothly, but when I brought the downed node back
online, it refused to reconnect to the DRBD "mail" resource.
I get the following from "crm_mon -1":
============
Last updated: Wed Nov 4 10:05:48 2009
Stack: openais
Current DC: soyuz - partition with quorum
Version: 1.0.5-462f1569a43740667daf7b0f6b521742e9eb8fa7
2 Nodes configured, 2 expected votes
4 Resources configured.
============
Online: [ soyuz gemini ]
st-gemini (stonith:external/riloe): Started soyuz
st-soyuz (stonith:external/riloe): Started gemini
Failed actions:
drbd-mail:1_start_0 (node=gemini, call=8, rc=-2, status=Timed Out):
unknown
So, I manually, re-attach and re-connect to the resource on the "quiet"
node and I see this:
# crm resource show
st-gemini (stonith:external/riloe) Started
st-soyuz (stonith:external/riloe) Stopped
Master/Slave Set: ms-drbd-mail
Masters: [ soyuz ]
Stopped: [ drbd-mail:1 ]
Resource Group: mailservice
fs-mail (ocf::heartbeat:Filesystem) Stopped
ip-mail (ocf::heartbeat:IPaddr2) Stopped
After about half an hour (and clearing up failure messages) it sorts
itself out and starts working again correctly.
Can anyone offer some advice as to why it might be doing this, please?
Thanks,
Ashley
- --
Ashley Woltering, Systems Analyst,
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.4-svn0 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iD8DBQFK8VoPh854NVK99FMRAlHBAJ9vum7mZYteZeXjai6fIt4JhHvrOACdHJ2i
ReEW/RmM9YOnV9y2UN1ncAA=
=z1nn
-----END PGP SIGNATURE-----
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems