We are using the following to create a 2-node highly-available cluster:
Disk device - fusion-io cards (PCIe SSD's)
DRBD/Corosync/Pacemaker
[r...@motest16 log]# rpm -qa | egrep "drbd|corosync|pacemaker"
drbd-pacemaker-8.3.7-1
drbd-8.3.7-1
drbd-bash-completion-8.3.7-1
drbd-xen-8.3.7-1
drbd-km-debuginfo-8.3.7-12
corosynclib-1.2.1-1.el5
drbd-utils-8.3.7-1
drbd-udev-8.3.7-1
drbd-km-2.6.18_164.15.1.0.1.el5-8.3.7-12
corosynclib-1.2.1-1.el5
pacemaker-1.0.8-6.el5
drbd-debuginfo-8.3.7-1
drbd-heartbeat-8.3.7-1
corosync-1.2.1-1.el5
pacemaker-libs-1.0.8-6.el5
[r...@motest16 log]# uname -r
2.6.18-164.15.1.0.1.el5
Terminology:
Pacemaker - Master/Slave
DRBD - Primary/Secondary
############################### TEST CASE #1 ###############################
OVERVIEW: Using dd /dev/random to test the switchover of drbd/pacemaker and it
succeeds.
motest16 Master/Primary
motest17 Slave/Secondary
1) Run a dd test unsing /dev/random
2) Set motest16 to standby
3) Check the cluster status using crm_mon to ensure failover
4) Check df on motest17 to see that it mounted /fusion
motest16:
[r...@motest16 log]# dd if=/dev/random of=/fusion/dd-test bs=1M count=100000
Terminated
[r...@motest16 log]#
motest17:
[r...@motest17 log]# df -lh | grep drbd
/dev/drbd1 301G 13G 288G 5% /fusion
crm_mon:
============
Last updated: Fri May 21 05:01:24 2010
Stack: openais
Current DC: motest16.apple.com - partition with quorum
Version: 1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7
2 Nodes configured, 2 expected votes
3 Resources configured.
============
Node motest16.apple.com: standby
Online: [ motest17.apple.com ]
FusionCluster (ocf::heartbeat:IPaddr2): Started motest17.apple.com
Master/Slave Set: FusionData
Masters: [ motest17.apple.com ]
Stopped: [ drbdFusion:1 ]
fsFusion (ocf::heartbeat:Filesystem): Started motest17.apple.com
############################### TEST CASE #2 ###############################
OVERVIEW: Using dd /dev/zero to test the switchover of drbd/pacemaker and it
fails. And pacemaker
does not switchover the master/slave indicating an issue with the
corosync/pacemaker layer.
motest17 Master/Primary
motest16 Slave/Secondary
1) Run a dd test unsing /dev/zero
2) Set motest17 to standby
3) Check df on motest17 to see the bad output on /fusion
a. if you try to unmount /fusion it states "not mounted"
4) Check the cluster status using crm_mon to ensure failover
motest17:
[r...@motest17 log]# !1027
dd if=/dev/zero of=/fusion/dd-test2 bs=1M count=100000
Terminated
[r...@motest17 log]# mount | grep drbd
/dev/drbd1 on /fusion type xfs (rw)
[r...@motest17 log]# df -lh | grep drbd
/dev/drbd1 95G 2.9G 87G 4% /fusion
[r...@motest17 log]# umount /fusion/
umount: /dev/drbd1: not mounted
umount: /dev/drbd1: not mounted
[r...@motest17 log]# df -lh | grep drbd
[r...@motest17 log]#
motest16:
[r...@motest16 log]# df -lh | grep drbd
[r...@motest16 log]#
crm_mon:
============
Last updated: Fri May 21 05:21:57 2010Stack: openais
Current DC: motest16.apple.com - partition with quorumVersion:
1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7
2 Nodes configured, 2 expected votes3 Resources configured.
============
Node motest17.apple.com: standbyOnline: [ motest16.apple.com ]
FusionCluster (ocf::heartbeat:IPaddr2): Started motest16.apple.com
Master/Slave Set: FusionData
Masters: [ motest17.apple.com ]
Slaves: [ motest16.apple.com ]
fsFusion (ocf::heartbeat:Filesystem): Started motest17.apple.com
(unmanaged) FAILED
Failed actions: fsFusion_stop_0 (node=motest17.apple.com, call=54, rc=-2,
status=Timed Out): unknown exec error
/var/log/messages:
May 21 05:18:19 motest17 lrmd: [24880]: info: rsc:fsFusion:54: stopMay 21
05:18:19 motest17 crmd: [24883]: info: do_lrm_rsc_op: Performing
key=52:39:0:5ab8262e-a01d-4de9-83bb-501625e3b973 op=drbdFusion:0_notify_0 )
May 21 05:18:19 motest17 lrmd: [24880]: info: rsc:drbdFusion:0:55: notify
May 21 05:18:19 motest17 Filesystem[22120]: INFO: Running stop for /dev/drbd1
on /fusion
May 21 05:18:19 motest17 lrmd: [24880]: info: Managed drbdFusion:0:notify
process 22121 exited with return code 0.
May 21 05:18:19 motest17 crmd: [24883]: info: process_lrm_event: LRM operation
drbdFusion:0_notify_0 (call=55, rc=0, cib-update=69, confirmed=true) ok
May 21 05:18:19 motest17 Filesystem[22120]: INFO: Trying to unmount /fusion
May 21 05:18:19 motest17 lrmd: [24880]: info: RA output: (fsFusion:stop:stderr)
umount: /fusion: device is busy umount: /fusion: device is busy
May 21 05:18:19 motest17 Filesystem[22120]: ERROR: Couldn't unmount /fusion;
trying cleanup with SIGTERM
May 21 05:18:19 motest17 lrmd: [24880]: info: RA output: (fsFusion:stop:stderr)
/fusion:
May 21 05:18:19 motest17 lrmd: [24880]: info: RA output: (fsFusion:stop:stdout)
21535
May 21 05:18:19 motest17 Filesystem[22120]: INFO: Some processes on /fusion
were signalled
May 21 05:18:39 motest17 lrmd: [24880]: WARN: fsFusion:stop process (PID 22120)
timed out (try 1). Killing with signal SIGTERM (15).
May 21 05:18:39 motest17 lrmd: [24880]: WARN: Managed fsFusion:stop process
22120 killed by signal 15 [SIGTERM - Termination (ANSI)].
May 21 05:18:39 motest17 lrmd: [24880]: WARN: operation stop[54] on
ocf::Filesystem::fsFusion for client 24883, its parameters: directory=[/fusion]
fstype=[xfs] device=[/dev/drbd1] CRM_meta_timeout=[20000]
crm_feature_set=[3.0.1] : pid [22120] timed out
May 21 05:18:39 motest17 crmd: [24883]: ERROR: process_lrm_event: LRM operation
fsFusion_stop_0 (54) Timed Out (timeout=20000ms)
May 21 05:18:39 motest17 attrd: [24881]: info: attrd_ais_dispatch: Update
relayed from motest16.apple.com
May 21 05:18:39 motest17 attrd: [24881]: info: find_hash_entry: Creating hash
entry for fail-count-fsFusion
May 21 05:18:39 motest17 attrd: [24881]: info: attrd_trigger_update: Sending
flush op to all hosts for: fail-count-fsFusion (INFINITY)
May 21 05:18:39 motest17 attrd: [24881]: info: attrd_perform_update: Sent
update 76: fail-count-fsFusion=INFINITY
May 21 05:18:39 motest17 attrd: [24881]: info: attrd_ais_dispatch: Update
relayed from motest16.apple.com
May 21 05:18:39 motest17 attrd: [24881]: info: find_hash_entry: Creating hash
entry for last-failure-fsFusion
May 21 05:18:39 motest17 attrd: [24881]: info: attrd_trigger_update: Sending
flush op to all hosts for: last-failure-fsFusion (1274444342)May 21 05:18:39
motest17 attrd: [24881]: info: attrd_perform_update: Sent update 79:
last-failure-fsFusion=1274444342
############################### cluster resources ###########################
[r...@motest16 ~]# ccs
node motest16.apple.com \
attributes standby="off"
node motest17.apple.com \
attributes standby="off"
primitive FusionCluster ocf:heartbeat:IPaddr2 \
params ip="17.209.103.248" cidr_netmask="24" \
op monitor interval="30s"
primitive drbdFusion ocf:linbit:drbd \
params drbd_resource="fusion" \
op start interval="0" timeout="240s" \
op stop interval="0" timeout="100s"
primitive fsFusion ocf:heartbeat:Filesystem \
params device="/dev/drbd1" directory="/fusion" fstype="xfs" \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="60s"
ms FusionData drbdFusion \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true"
colocation fs_on_drbd inf: fsFusion FusionData:Master
order fsFusion-after-FusionData inf: FusionData:promote fsFusion:start
property $id="cib-bootstrap-options" \
dc-version="1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
stonith-enabled="false" \
no-quorum-policy="ignore"
############################### corosync.conf ###############################
totem {
version: 2
token: 3000
token_retransmits_before_loss_const: 10
join: 60
consensus: 3600
vsftype: none
max_messages: 20
clear_node_high_bit: yes
secauth: on
threads: 0
rrp_mode: passive
interface {
ringnumber: 0
bindnetaddr: 17.209.103.0
mcastaddr: 239.94.1.1
mcastport: 5405
}
}
logging {
to_stderr: yes
debug: on
timestamp: on
to_logfile: yes
to_syslog: yes
syslog_facility: daemon
logfile: /var/log/corosync.log
/var/log/corosync.log {
missingok
compress
notifempty
daily
rotate 7
copytruncate
}
}
amf {
mode: disabled
}
service {
ver: 0
name: pacemaker
}
corosync {
user: root
group: root
}
aisexec {
user: root
group: root
}
############################### drbd.conf ###############################
# You can find an example in /usr/share/doc/drbd.../drbd.conf.example
#include "drbd.d/global_common.conf";
#include "drbd.d/*.res";
global {
usage-count yes;
}
resource fusion {
device /dev/drbd1;
disk /dev/fioa;
meta-disk internal;
protocol C;
syncer {
rate 1G;
verify-alg sha1;
}
net {
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri call-pri-lost-after-sb;
}
handlers {
split-brain "/usr/lib/drbd/notify-split-brain.sh root";
split-brain "/usr/lib/drbd/notify-pri-lost-after-sb.sh root";
}
on motest16.apple.com {
address 17.209.103.135:7789;
}
on motest17.apple.com {
address 17.209.103.136:7789; }
}
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais