We are using the following to create a 2-node highly-available cluster:

Disk device - fusion-io cards (PCIe SSD's)
DRBD/Corosync/Pacemaker

[r...@motest16 log]# rpm -qa | egrep "drbd|corosync|pacemaker"
drbd-pacemaker-8.3.7-1
drbd-8.3.7-1
drbd-bash-completion-8.3.7-1
drbd-xen-8.3.7-1
drbd-km-debuginfo-8.3.7-12
corosynclib-1.2.1-1.el5
drbd-utils-8.3.7-1
drbd-udev-8.3.7-1
drbd-km-2.6.18_164.15.1.0.1.el5-8.3.7-12
corosynclib-1.2.1-1.el5
pacemaker-1.0.8-6.el5
drbd-debuginfo-8.3.7-1
drbd-heartbeat-8.3.7-1
corosync-1.2.1-1.el5
pacemaker-libs-1.0.8-6.el5

[r...@motest16 log]# uname -r
2.6.18-164.15.1.0.1.el5

Terminology:
Pacemaker - Master/Slave
DRBD      - Primary/Secondary

############################### TEST CASE #1 ###############################
OVERVIEW: Using dd /dev/random to test the switchover of drbd/pacemaker and it 
succeeds. 
motest16 Master/Primary
motest17 Slave/Secondary

1) Run a dd test unsing /dev/random
2) Set motest16 to standby
3) Check the cluster status using crm_mon to ensure failover
4) Check df on motest17 to see that it mounted /fusion

motest16:
[r...@motest16 log]# dd if=/dev/random of=/fusion/dd-test bs=1M count=100000
Terminated
[r...@motest16 log]#

motest17:
[r...@motest17 log]# df -lh | grep drbd
/dev/drbd1            301G   13G  288G   5% /fusion

crm_mon:
============
Last updated: Fri May 21 05:01:24 2010
Stack: openais
Current DC: motest16.apple.com - partition with quorum
Version: 1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7
2 Nodes configured, 2 expected votes
3 Resources configured.
============

Node motest16.apple.com: standby
Online: [ motest17.apple.com ]

FusionCluster   (ocf::heartbeat:IPaddr2):       Started motest17.apple.com
 Master/Slave Set: FusionData
     Masters: [ motest17.apple.com ]
     Stopped: [ drbdFusion:1 ]
fsFusion        (ocf::heartbeat:Filesystem):    Started motest17.apple.com



############################### TEST CASE #2 ###############################
OVERVIEW: Using dd /dev/zero to test the switchover of drbd/pacemaker and it 
fails. And pacemaker
does not switchover the master/slave indicating an issue with the 
corosync/pacemaker layer.
motest17 Master/Primary
motest16 Slave/Secondary

1) Run a dd test unsing /dev/zero
2) Set motest17 to standby
3) Check df on motest17 to see the bad output on /fusion 
        a. if you try to unmount /fusion it states "not mounted"
4) Check the cluster status using crm_mon to ensure failover

motest17:
[r...@motest17 log]# !1027
dd if=/dev/zero of=/fusion/dd-test2 bs=1M count=100000
Terminated
[r...@motest17 log]# mount | grep drbd
/dev/drbd1 on /fusion type xfs (rw)
[r...@motest17 log]# df -lh | grep drbd
/dev/drbd1             95G  2.9G   87G   4% /fusion
[r...@motest17 log]# umount /fusion/
umount: /dev/drbd1: not mounted
umount: /dev/drbd1: not mounted
[r...@motest17 log]# df -lh | grep drbd
[r...@motest17 log]#

motest16:
[r...@motest16 log]# df -lh | grep drbd
[r...@motest16 log]#

crm_mon:
============
Last updated: Fri May 21 05:21:57 2010Stack: openais
Current DC: motest16.apple.com - partition with quorumVersion: 
1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7
2 Nodes configured, 2 expected votes3 Resources configured.
============
Node motest17.apple.com: standbyOnline: [ motest16.apple.com ]

FusionCluster   (ocf::heartbeat:IPaddr2):       Started motest16.apple.com 
Master/Slave Set: FusionData
     Masters: [ motest17.apple.com ]
     Slaves: [ motest16.apple.com ]
fsFusion        (ocf::heartbeat:Filesystem):    Started motest17.apple.com 
(unmanaged) FAILED

Failed actions:   fsFusion_stop_0 (node=motest17.apple.com, call=54, rc=-2, 
status=Timed Out): unknown exec error

 
/var/log/messages:
May 21 05:18:19 motest17 lrmd: [24880]: info: rsc:fsFusion:54: stopMay 21 
05:18:19 motest17 crmd: [24883]: info: do_lrm_rsc_op: Performing 
key=52:39:0:5ab8262e-a01d-4de9-83bb-501625e3b973 op=drbdFusion:0_notify_0 )
May 21 05:18:19 motest17 lrmd: [24880]: info: rsc:drbdFusion:0:55: notify
May 21 05:18:19 motest17 Filesystem[22120]: INFO: Running stop for /dev/drbd1 
on /fusion
May 21 05:18:19 motest17 lrmd: [24880]: info: Managed drbdFusion:0:notify 
process 22121 exited with return code 0.
May 21 05:18:19 motest17 crmd: [24883]: info: process_lrm_event: LRM operation 
drbdFusion:0_notify_0 (call=55, rc=0, cib-update=69, confirmed=true) ok
May 21 05:18:19 motest17 Filesystem[22120]: INFO: Trying to unmount /fusion
May 21 05:18:19 motest17 lrmd: [24880]: info: RA output: (fsFusion:stop:stderr) 
umount: /fusion: device is busy umount: /fusion: device is busy 
May 21 05:18:19 motest17 Filesystem[22120]: ERROR: Couldn't unmount /fusion; 
trying cleanup with SIGTERM
May 21 05:18:19 motest17 lrmd: [24880]: info: RA output: (fsFusion:stop:stderr) 
/fusion:             
May 21 05:18:19 motest17 lrmd: [24880]: info: RA output: (fsFusion:stop:stdout) 
 21535
May 21 05:18:19 motest17 Filesystem[22120]: INFO: Some processes on /fusion 
were signalled
May 21 05:18:39 motest17 lrmd: [24880]: WARN: fsFusion:stop process (PID 22120) 
timed out (try 1).  Killing with signal SIGTERM (15).
May 21 05:18:39 motest17 lrmd: [24880]: WARN: Managed fsFusion:stop process 
22120 killed by signal 15 [SIGTERM - Termination (ANSI)].
May 21 05:18:39 motest17 lrmd: [24880]: WARN: operation stop[54] on 
ocf::Filesystem::fsFusion for client 24883, its parameters: directory=[/fusion] 
fstype=[xfs] device=[/dev/drbd1] CRM_meta_timeout=[20000] 
crm_feature_set=[3.0.1] : pid [22120] timed out
May 21 05:18:39 motest17 crmd: [24883]: ERROR: process_lrm_event: LRM operation 
fsFusion_stop_0 (54) Timed Out (timeout=20000ms)
May 21 05:18:39 motest17 attrd: [24881]: info: attrd_ais_dispatch: Update 
relayed from motest16.apple.com
May 21 05:18:39 motest17 attrd: [24881]: info: find_hash_entry: Creating hash 
entry for fail-count-fsFusion
May 21 05:18:39 motest17 attrd: [24881]: info: attrd_trigger_update: Sending 
flush op to all hosts for: fail-count-fsFusion (INFINITY)
May 21 05:18:39 motest17 attrd: [24881]: info: attrd_perform_update: Sent 
update 76: fail-count-fsFusion=INFINITY
May 21 05:18:39 motest17 attrd: [24881]: info: attrd_ais_dispatch: Update 
relayed from motest16.apple.com
May 21 05:18:39 motest17 attrd: [24881]: info: find_hash_entry: Creating hash 
entry for last-failure-fsFusion
May 21 05:18:39 motest17 attrd: [24881]: info: attrd_trigger_update: Sending 
flush op to all hosts for: last-failure-fsFusion (1274444342)May 21 05:18:39 
motest17 attrd: [24881]: info: attrd_perform_update: Sent update 79: 
last-failure-fsFusion=1274444342

############################### cluster resources ###########################

[r...@motest16 ~]# ccs
node motest16.apple.com \
        attributes standby="off"
node motest17.apple.com \
        attributes standby="off"
primitive FusionCluster ocf:heartbeat:IPaddr2 \
        params ip="17.209.103.248" cidr_netmask="24" \
        op monitor interval="30s"
primitive drbdFusion ocf:linbit:drbd \
        params drbd_resource="fusion" \
        op start interval="0" timeout="240s" \
        op stop interval="0" timeout="100s"
primitive fsFusion ocf:heartbeat:Filesystem \
        params device="/dev/drbd1" directory="/fusion" fstype="xfs" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="60s"
ms FusionData drbdFusion \
        meta master-max="1" master-node-max="1" clone-max="2" 
clone-node-max="1" notify="true"
colocation fs_on_drbd inf: fsFusion FusionData:Master
order fsFusion-after-FusionData inf: FusionData:promote fsFusion:start
property $id="cib-bootstrap-options" \
        dc-version="1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        stonith-enabled="false" \
        no-quorum-policy="ignore"


############################### corosync.conf ###############################
totem {
  version: 2
  token: 3000
  token_retransmits_before_loss_const: 10
  join: 60
  consensus: 3600 
  vsftype: none
  max_messages: 20
  clear_node_high_bit: yes
  secauth: on
  threads: 0
  rrp_mode: passive
  interface {
    ringnumber: 0
    bindnetaddr: 17.209.103.0
    mcastaddr: 239.94.1.1
    mcastport: 5405
  }
}

logging {
  to_stderr: yes
  debug: on
  timestamp: on
  to_logfile: yes 
  to_syslog: yes
  syslog_facility: daemon
  logfile: /var/log/corosync.log
        /var/log/corosync.log {
                  missingok
                  compress
                  notifempty
                  daily
                  rotate 7
                  copytruncate
        }
}

amf {
  mode: disabled
}

service {
  ver:       0
  name:      pacemaker
}

corosync {
  user:   root
  group:  root
}

aisexec {
  user:   root
  group:  root
}

############################### drbd.conf ###############################
# You can find an example in  /usr/share/doc/drbd.../drbd.conf.example

#include "drbd.d/global_common.conf";
#include "drbd.d/*.res";

global {
  usage-count yes;
}

resource fusion {
  device    /dev/drbd1;
  disk      /dev/fioa;
  meta-disk internal;
  protocol C;

  syncer {
    rate 1G;
    verify-alg sha1;
  }
  
  net {
    after-sb-0pri discard-zero-changes;
    after-sb-1pri discard-secondary;
    after-sb-2pri call-pri-lost-after-sb;
  }

  handlers {
    split-brain "/usr/lib/drbd/notify-split-brain.sh root";
    split-brain "/usr/lib/drbd/notify-pri-lost-after-sb.sh root";
  }

  on motest16.apple.com {
    address   17.209.103.135:7789;
  }

  on motest17.apple.com {
    address   17.209.103.136:7789;  }
}
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to