Hello,

I have been working on a complete fail-over SAN for some time now and almost
have everything working the way it should.  However, there have been some
drawbacks.  I am using the most up to date version of Heartbeat and
Pacemaker.  I have been modifying and testing everything through the CRM
CLI.  First off, I have not done much testing past putting each machine into
standby mode.  Here is the topology of the fail-over system:
http://www.nabble.com/file/p22460063/SAN.jpg 

And here is my configuration when I go into the CRM CLI: 

crm(live)configure# show

primitive R_IP_Target ocf:heartbeat:IPaddr2 \
        params ip="192.168.3.137" \
        params nic="eth0" \
        params iflabel="1" \
        op monitor interval="30s"
primitive R_tgtd ocf:acs:tgtd \
        op monitor interval="30s"
primitive R_IP_Init ocf:heartbeat:IPaddr2 \
        params ip="192.168.3.133" \
        params nic="eth0" \
        params iflabel="1" \
        op monitor interval="30s"
primitive R_iscsi ocf:heartbeat:iscsi \
        params target="target1.acsacc.com" \
        params portal="192.168.3.137" \
        op monitor interval="30s" \
        op start interval="0" timeout="60s"
primitive R_LVM ocf:heartbeat:LVM \
        params volgrpname="VolGroup01" \
        op monitor interval="30s" \
        op start interval="0" timeout="60s"
primitive R_Filesystem ocf:heartbeat:Filesystem \
        params device="/dev/VolGroup01/LogVol00" \
        params directory="/san_targets/www" \
        params fstype="ext3" \
        op monitor interval="30s" \
        op start interval="0" timeout="60s"
primitive R_NFS ocf:heartbeat:nfsserver \
        params nfs_init_script="/etc/init.d/nfs" \
        params nfs_notify_cmd="/sbin/rpc.statd" \
        params nfs_shared_infodir="/san_targets/www/nfsinfo" \
        params nfs_ip="192.168.3.133" \
        op monitor interval="30s"
primitive drbd0 ocf:heartbeat:drbd \
        params drbd_resource="drbd0" \
        op monitor interval="29s" role="Master" timeout="30s" \
        op monitor interval="30s" role="Slave" timeout="30s"
primitive drbd1 ocf:heartbeat:drbd \
        params drbd_resource="drbd1" \
        op monitor interval="29s" role="Master" timeout="30s" \
        op monitor interval="30s" role="Slave" timeout="30s"
primitive drbd2 ocf:heartbeat:drbd \
        params drbd_resource="drbd2" \
        op monitor interval="29s" role="Master" timeout="30s" \
        op monitor interval="30s" role="Slave" timeout="30s"
primitive R_pingd ocf:pacemaker:pingd
group G_Target R_IP_Target R_tgtd \
        meta target-role="Started"
group G_Init R_IP_Init R_iscsi R_LVM R_Filesystem R_NFS \
        meta target-role="Started"
ms ms-drbd0 drbd0 \
        meta clone-max="2" notify="true" globally-unique="false"
target-role="Started"
ms ms-drbd1 drbd1 \
        meta clone-max="2" notify="true" globally-unique="false"
target-role="Started"
ms ms-drbd2 drbd2 \
        meta clone-max="2" notify="true" globally-unique="false"
target-role="Started"
clone pingd R_pingd \
        meta target-role="Started"
location ms-drbd0-pref-1 ms-drbd0 200: san1.acsacc.com
location ms-drbd0-pref-2 ms-drbd0 100: san2.acsacc.com
location ms-drbd1-pref-1 ms-drbd1 200: san1.acsacc.com
location ms-drbd1-pref-2 ms-drbd1 100: san2.acsacc.com
location ms-drbd2-pref-1 ms-drbd2 200: san1.acsacc.com
location ms-drbd2-pref-2 ms-drbd2 100: san2.acsacc.com
location G_Target-pref-1 G_Target 200: san1.acsacc.com
location G_Target-pref-2 G_Target 100: san2.acsacc.com
location G_Init-pref-1 G_Init 200: init1.acsacc.com
location G_Init-pref-2 G_Init 100: init2.acsacc.com
location ms-drbd0-not-on-1 ms-drbd0 -inf: init1.acsacc.com
location ms-drbd0-not-on-2 ms-drbd0 -inf: init2.acsacc.com
location ms-drbd1-not-on-1 ms-drbd1 -inf: init1.acsacc.com
location ms-drbd1-not-on-2 ms-drbd1 -inf: init2.acsacc.com
location ms-drbd2-not-on-1 ms-drbd2 -inf: init1.acsacc.com
location ms-drbd2-not-on-2 ms-drbd2 -inf: init2.acsacc.com
location G_Target-not-on-1 G_Target -inf: init1.acsacc.com
location G_Target-not-on-2 G_Target -inf: init2.acsacc.com
location G_Init-not-on-1 G_Init -inf: san1.acsacc.com
location G_Init-not-on-2 G_Init -inf: san2.acsacc.com
location pingd-node-1 pingd 500: init1.acsacc.com
location pingd-node-2 pingd 500: init2.acsacc.com
location pingd-node-3 pingd 500: san1.acsacc.com
location pingd-node-4 pingd 500: san2.acsacc.com
property $id="cib-bootstrap-options" \
        dc-version="1.0.2-c02b459053bfa44d509a2a0e0247b291d93662b7" \
        stonith-enabled="false" \
        stonith-action="reboot" \
        stop-orphan-resources="true" \
        stop-orphan-actions="true" \
        symmetric-cluster="false" \
        last-lrm-refresh="1236720670"


I have three drbd devices that are set up to replicate between the two
targets (san1 & san2) and need to fail-over quickly.  For the most part,
they do.  However, I think my constraints need some adjustment in order for
drbd to promote on the other machine, as well as demote the machine that was
just placed into standby.  And to fix a few more issues as well.  This is
what happens when I put each preferred machine into standby mode:

Init1:
-Switches over to init2 with no issues, flawless and quick
-When init1 is placed back into online mode, the resources begin to switch
back to init1, but fail while attempting to start the LVM (R_LVM) resource. 
Resources then revert back to init2.  I can get all the resources to switch
back over to init1, but that requires init2 to be placed into standby mode
and a cleanup of R_LVM on init1.  And even that may not work and may require
some fixing elsewhere.
-After fixing the last issue by hand, I attempted to place init1 back into
standby mode to test again.  This time, R_LVM came back up with no issues,
but R_NFS failed and than all resources were placed back on to init2 like
the first test.  After applying a cleanup to R_NFS, I notice in crm_mon that
it tries to start on san1 and san2!  Looking at my constraints, I don't see
why it would try to do that.  I cannot seem to place all the resources back
on to init1 after this point.  This usually means that I would need to take
the system (as a whole) down to correct the situation.  Which obviously,
cannot happen.

San1:
-If I place san1 into standby mode, everything fails.  It attempts to switch
san2 to master for the drbd devices, and san1 to slave, but fails, thus also
stopping the R_NFS, R_Filesystem and R_LVM resources on the initiator.

Are there some things that I am missing in my configuration that will remedy
this?  I was thinking that a delay of some sort would need to be given for
each resource that is effected by the node change.  Unfortunately, I cannot
find any good documentation on how to do this in the CRM CLI.  Also, could
someone please take a look at my constraints?  I have a feeling that most of
my problems lay within the constraints and if anything sticks out, it would
be great to know :-D

Any help would be greatly appreciated!

Thanks!
-- 
View this message in context: 
http://www.nabble.com/Fencing-trouble%21--Need-some-help%21-tp22460063p22460063.html
Sent from the Linux-HA mailing list archive at Nabble.com.

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to