Re: [Linux-HA] Fencing trouble! Need some help!

Andreas Kurz Wed, 11 Mar 2009 15:36:41 -0700

On Wed, March 11, 2009 18:19, Ethan Bannister wrote:
>

> Hello,
>
>
> I have been working on a complete fail-over SAN for some time now and
> almost have everything working the way it should.  However, there have
> been some drawbacks.  I am using the most up to date version of Heartbeat
> and Pacemaker.  I have been modifying and testing everything through the
> CRM
> CLI.  First off, I have not done much testing past putting each machine
> into standby mode.  Here is the topology of the fail-over system:
> http://www.nabble.com/file/p22460063/SAN.jpg
>
>
> And here is my configuration when I go into the CRM CLI:
>
>
> crm(live)configure# show
>
> primitive R_IP_Target ocf:heartbeat:IPaddr2 \ params ip="192.168.3.137" \
> params nic="eth0" \ params iflabel="1" \ op monitor interval="30s"
primitive
> R_tgtd ocf:acs:tgtd \
> op monitor interval="30s" primitive R_IP_Init ocf:heartbeat:IPaddr2 \
params
> ip="192.168.3.133" \ params nic="eth0" \ params iflabel="1" \ op monitor
> interval="30s" primitive R_iscsi ocf:heartbeat:iscsi \ params
> target="target1.acsacc.com" \ params portal="192.168.3.137" \ op monitor
> interval="30s" \ op start interval="0" timeout="60s" primitive R_LVM
> ocf:heartbeat:LVM \
> params volgrpname="VolGroup01" \ op monitor interval="30s" \ op start
> interval="0" timeout="60s" primitive R_Filesystem ocf:heartbeat:Filesystem
> \
> params device="/dev/VolGroup01/LogVol00" \ params
> directory="/san_targets/www" \ params fstype="ext3" \ op monitor
> interval="30s" \ op start interval="0" timeout="60s" primitive R_NFS
> ocf:heartbeat:nfsserver \
> params nfs_init_script="/etc/init.d/nfs" \ params
> nfs_notify_cmd="/sbin/rpc.statd" \ params
> nfs_shared_infodir="/san_targets/www/nfsinfo" \ params
> nfs_ip="192.168.3.133" \ op monitor interval="30s" primitive drbd0
> ocf:heartbeat:drbd \
> params drbd_resource="drbd0" \ op monitor interval="29s" role="Master"
> timeout="30s" \ op monitor interval="30s" role="Slave" timeout="30s"
> primitive drbd1 ocf:heartbeat:drbd \ params drbd_resource="drbd1" \ op
> monitor interval="29s" role="Master" timeout="30s" \ op monitor
> interval="30s" role="Slave" timeout="30s" primitive drbd2
> ocf:heartbeat:drbd \
> params drbd_resource="drbd2" \ op monitor interval="29s" role="Master"
> timeout="30s" \ op monitor interval="30s" role="Slave" timeout="30s"
> primitive R_pingd ocf:pacemaker:pingd group G_Target R_IP_Target R_tgtd \
> meta target-role="Started" group G_Init R_IP_Init R_iscsi R_LVM
> R_Filesystem R_NFS \
> meta target-role="Started" ms ms-drbd0 drbd0 \ meta clone-max="2"
> notify="true" globally-unique="false" target-role="Started" ms ms-drbd1
> drbd1 \ meta clone-max="2" notify="true" globally-unique="false"
> target-role="Started" ms ms-drbd2 drbd2 \ meta clone-max="2" notify="true"
> globally-unique="false" target-role="Started" clone pingd R_pingd \ meta
> target-role="Started" location ms-drbd0-pref-1 ms-drbd0 200:
> san1.acsacc.com location ms-drbd0-pref-2 ms-drbd0 100: san2.acsacc.com
> location ms-drbd1-pref-1 ms-drbd1 200: san1.acsacc.com location
> ms-drbd1-pref-2 ms-drbd1 100: san2.acsacc.com location ms-drbd2-pref-1
> ms-drbd2 200: san1.acsacc.com location ms-drbd2-pref-2 ms-drbd2 100:
> san2.acsacc.com location G_Target-pref-1 G_Target 200: san1.acsacc.com
> location G_Target-pref-2 G_Target 100: san2.acsacc.com location
> G_Init-pref-1 G_Init 200: init1.acsacc.com
> location G_Init-pref-2 G_Init 100: init2.acsacc.com location
> ms-drbd0-not-on-1 ms-drbd0 -inf: init1.acsacc.com location
> ms-drbd0-not-on-2 ms-drbd0 -inf: init2.acsacc.com location
> ms-drbd1-not-on-1 ms-drbd1 -inf: init1.acsacc.com location
> ms-drbd1-not-on-2 ms-drbd1 -inf: init2.acsacc.com location
> ms-drbd2-not-on-1 ms-drbd2 -inf: init1.acsacc.com location
> ms-drbd2-not-on-2 ms-drbd2 -inf: init2.acsacc.com location
> G_Target-not-on-1 G_Target -inf: init1.acsacc.com
> location G_Target-not-on-2 G_Target -inf: init2.acsacc.com location
> G_Init-not-on-1 G_Init -inf: san1.acsacc.com
> location G_Init-not-on-2 G_Init -inf: san2.acsacc.com location pingd-node-1
> pingd 500: init1.acsacc.com location pingd-node-2 pingd 500:
> init2.acsacc.com location pingd-node-3 pingd 500: san1.acsacc.com location
> pingd-node-4 pingd 500: san2.acsacc.com property
> $id="cib-bootstrap-options" \
> dc-version="1.0.2-c02b459053bfa44d509a2a0e0247b291d93662b7" \
> stonith-enabled="false" \ stonith-action="reboot" \
> stop-orphan-resources="true" \ stop-orphan-actions="true" \
> symmetric-cluster="false" \ last-lrm-refresh="1236720670"
>
>
> I have three drbd devices that are set up to replicate between the two
> targets (san1 & san2) and need to fail-over quickly.  For the most part,
> they do.  However, I think my constraints need some adjustment in order
> for drbd to promote on the other machine, as well as demote the machine
> that was just placed into standby.  And to fix a few more issues as well.
> This is
> what happens when I put each preferred machine into standby mode:
>
> Init1:
> -Switches over to init2 with no issues, flawless and quick
> -When init1 is placed back into online mode, the resources begin to switch
>  back to init1, but fail while attempting to start the LVM (R_LVM)
> resource. Resources then revert back to init2.  I can get all the
> resources to switch back over to init1, but that requires init2 to be
> placed into standby mode and a cleanup of R_LVM on init1.  And even that
> may not work and may require some fixing elsewhere. -After fixing the last
> issue by hand, I attempted to place init1 back into standby mode to test
> again.  This time, R_LVM came back up with no issues, but R_NFS failed and
> than all resources were placed back on to init2 like the first test.
> After applying a cleanup to R_NFS, I notice in crm_mon that
> it tries to start on san1 and san2!  Looking at my constraints, I don't
> see why it would try to do that.  I cannot seem to place all the resources
> back on to init1 after this point.  This usually means that I would need
> to take the system (as a whole) down to correct the situation.  Which
> obviously, cannot happen.
>
> San1:
> -If I place san1 into standby mode, everything fails.  It attempts to
> switch san2 to master for the drbd devices, and san1 to slave, but fails,
> thus also stopping the R_NFS, R_Filesystem and R_LVM resources on the
> initiator.
>
> Are there some things that I am missing in my configuration that will
> remedy this?  I was thinking that a delay of some sort would need to be
> given for each resource that is effected by the node change.
> Unfortunately, I cannot
> find any good documentation on how to do this in the CRM CLI.  Also, could
>  someone please take a look at my constraints?  I have a feeling that
> most of my problems lay within the constraints and if anything sticks out,
> it would be great to know :-D
>
> Any help would be greatly appreciated!


Have a look at http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0 ... watch
out for (role) Master and (action) promote

Regards,
Andreas

-- 
: Andreas Kurz
: LINBIT | Your Way to High Availability
: Tel +43-1-8178292-64, Fax +43-1-8178292-82
:
: http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

This e-mail is solely for use by the intended recipient(s). Information
contained in this e-mail and its attachments may be confidential,
privileged or copyrighted. If you are not the intended recipient you are
hereby formally notified that any use, copying, disclosure or
distribution of the contents of this e-mail, in whole or in part, is
prohibited. Also please notify immediately the sender by return e-mail
and delete this e-mail from your system. Thank you for your co-operation.

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Fencing trouble! Need some help!

Reply via email to