Re: [Linux-HA] DRBD does not switch resources to other node properly

Dejan Muhamedagic Fri, 17 Apr 2009 04:53:13 -0700

Hi,

On Thu, Apr 16, 2009 at 01:15:55PM -0700, Ethan Bannister wrote:
> 
> /var/log/messages on san2 states that it couldn't promote drbd1:1 on san2
> because san1 was still in primary mode.  This makes sense.  But why would it
> have no issues with taking down the other drbd devices on san1 and not
> drbd1?  Is there a log file that may give me a better idea of what may be
> going on?  I am assuming that when I pull the cable or take down eth0, the
> rest of the cluster is unable to tell san1 to demote the drbd devices so
> that san2 can then promote them.  But from what I gather from this log file,
> drbdadm does all of this.  So would it be safe to assume that drbdadm
> communicates through the direct link between the two targets and it is
> failing for drbd1 for some reason?


AFAIK, drbd is using just one link. If that cable is pulled, then
you have a drbd split brain. BTW, you may want to take a look at
dopd to have heartbeat help drbd in this case.

Thanks,

Dejan

> This is puzzling me.  I know that I am
> missing something that is right under my nose :confused:
> 
> Apr 16 14:04:39 san2 lrmd: [12984]: info: rsc:drbd1:1: promote
> Apr 16 14:04:40 san2 crmd: [12987]: info: process_lrm_event: LRM operation
> drbd0:1_monitor_29000 (call=107, rc=8, cib-update=133, confirmed=false)
> complete master
> Apr 16 14:04:41 san2 lrmd: [12984]: info: RA output:
> (drbd1:1:promote:stdout) /dev/drbd1: State change failed: (-1) Multiple
> primaries not allowed by config Command 'drbdsetup /dev/drbd1 primary'
> terminated with exit code 11
> Apr 16 14:04:41 san2 drbd[6372]: [6459]: ERROR: drbd1 promote: Not primary
> despite drbdadm call.
> Apr 16 14:04:41 san2 crmd: [12987]: info: process_lrm_event: LRM operation
> drbd1:1_promote_0 (call=108, rc=1, cib-update=134, confirmed=true) complete
> unknown error
> Apr 16 14:04:41 san2 kernel: drbd1: peer( Primary -> Secondary )
> Apr 16 14:04:42 san2 kernel: drbd1: peer( Secondary -> Unknown ) conn(
> Connected -> TearDown ) pdsk( UpToDate -> DUnknown )
> Apr 16 14:04:42 san2 kernel: drbd1: Writing meta data super block now.
> Apr 16 14:04:42 san2 kernel: drbd1: asender terminated
> Apr 16 14:04:42 san2 kernel: drbd1: Terminating asender thread
> Apr 16 14:04:42 san2 kernel: drbd1: tl_clear()
> Apr 16 14:04:42 san2 kernel: drbd1: Connection closed
> Apr 16 14:04:42 san2 kernel: drbd1: conn( TearDown -> Unconnected )
> Apr 16 14:04:42 san2 kernel: drbd1: receiver terminated
> Apr 16 14:04:42 san2 kernel: drbd1: Restarting receiver thread
> Apr 16 14:04:42 san2 kernel: drbd1: receiver (re)started
> Apr 16 14:04:42 san2 kernel: drbd1: conn( Unconnected -> WFConnection )
> Apr 16 14:04:42 san2 crmd: [12987]: info: do_lrm_rsc_op: Performing
> key=174:43:0:90b5d1cc-a955-48e8-a1a6-7a2674a8c783 op=drbd1:1_notify_0 )
> Apr 16 14:04:42 san2 lrmd: [12984]: info: rsc:drbd1:1: notify
> Apr 16 14:04:42 san2 crm_master: [6492]: info: Invoked: /usr/sbin/crm_master
> -l reboot -v 10
> Apr 16 14:04:42 san2 attrd: [12986]: info: attrd_trigger_update: Sending
> flush op to all hosts for: master-drbd1:1
> Apr 16 14:04:42 san2 attrd: [12986]: info: attrd_perform_update: Sent update
> 118: master-drbd1:1=10
> Apr 16 14:04:42 san2 lrmd: [12984]: info: RA output: (drbd1:1:notify:stdout)
> 0 Trying master-drbd1:1=10 update via attrd
> Apr 16 14:04:42 san2 crmd: [12987]: info: process_lrm_event: LRM operation
> drbd1:1_notify_0 (call=109, rc=0, cib-update=135, confirmed=true) complete
> ok
> Apr 16 14:04:43 san2 crmd: [12987]: info: do_lrm_rsc_op: Performing
> key=170:44:0:90b5d1cc-a955-48e8-a1a6-7a2674a8c783 op=drbd1:1_notify_0 )
> Apr 16 14:04:43 san2 lrmd: [12984]: info: rsc:drbd1:1: notify
> 
> 
> 
> Dejan Muhamedagic wrote:
> > 
> > Hi,
> > 
> > On Thu, Apr 16, 2009 at 10:11:26AM -0700, Ethan Bannister wrote:
> >> 
> >> Perhaps someone may be able to give me a little insight on what I may be
> >> doing wrong.  I would like to have DRBD promote on secondary machine when
> >> the Ethernet connection to the initiator on my SAN goes down.  When I
> >> pull
> >> the cable or bring eth0 down which IPaddr resides on, this is what
> >> crm_mon
> >> shows me soon after:
> >> 
> >> ============
> >> Last updated: Thu Apr 16 12:38:36 2009
> >> Current DC: init2.mydomain.com (1d3814dc-7928-4beb-99f6-c7ade09056a5) -
> >> partition with quorum
> >> Version: 1.0.3-b133b3f19797c00f9189f4b66b513963f9d25db9
> >> 4 Nodes configured, unknown expected votes
> >> 8 Resources configured.
> >> ============
> >> 
> >> Online: [ san2.mydomain.com init2.mydomain.com init1.mydomain.com ]
> >> OFFLINE: [ san1.mydomain.com ]
> >> 
> >> Resource Group: G_Target
> >>     R_IP_Target (ocf::heartbeat:IPaddr2):  Started san2.mydomain.com
> >>     R_tgtd (ocf::acs:tgtdra):      Started san2.mydomain.com
> >> Master/Slave Set: ms-drbd0
> >>         Masters: [ san2.mydomain.com ]
> >>         Stopped: [ drbd0:0 ]     <---------- correct
> >> Master/Slave Set: ms-drbd1
> >>         Masters: [ san2.mydomain.com ]
> >>         Stopped: [ drbd1:1 ]     <---------- incorrect
> >> Master/Slave Set: ms-drbd2
> >>         Masters: [ san2.mydomain.com ]
> >>         Stopped: [ drbd2:0 ]     <---------- correct
> >> Clone Set: pingd
> >>         Started: [ init1.mydomain.com init2.mydomain.com
> >> san2.mydomain.com ]
> >>         Stopped: [ R_pingd:2 ]
> >> 
> >> Failed actions:
> >>     drbd1:1_promote_0 (node=san2.mydomain.com, call=43, rc=1,
> >> status=complete): unknown error
> > 
> > Does drbd report any error in the logs (look form lrmd.*drbd)?
> > This looks like a resource or a drbd RA issue.
> > 
> > Thanks,
> > 
> > Dejan
> > 
> >> As you can see, drbd0 and drbd2 promote with no issues.  But drbd1 is not
> >> promoting properly.  I have checked my constraints, and I have tweaked
> >> out
> >> the start-delay settings, but nothing happens the way I would like.  I
> >> have
> >> two initiators for redundancy as well.  But I want the initiator to stay
> >> up
> >> if the network goes down on either target.  This has been puzzling me for
> >> some time now.  Any help would be greatly appreciated.
> > 
> >> Here is what I have for a crm cli config:
> >> 
> >> node $id="cee46f54-d517-4e4d-b0b8-3076fbc5751b" san2.mydomain.com \
> >>         attributes standby="off"
> >> node $id="bde24914-1235-4dc4-8686-f05fd9e6a35e" san1.mydomain.com \
> >>         attributes standby="off"
> >> node $id="1d3814dc-7928-4beb-99f6-c7ade09056a5" init2.mydomain.com \
> >>         attributes standby="off"
> >> node $id="a058cd72-b27e-4593-ac7e-d79db0709c15" init1.mydomain.com \
> >>         attributes standby="off"
> >> primitive R_IP_Target ocf:heartbeat:IPaddr2 \
> >>         params ip="192.168.*.*" \
> >>         params nic="eth0" \
> >>         params iflabel="1" \
> >>         op monitor interval="30s"
> >> primitive R_tgtd ocf:acs:tgtdra \
> >>         op monitor interval="30s" \
> >>         op start interval="0" timeout="30s" start-delay="2s"
> >> primitive R_IP_Init ocf:heartbeat:IPaddr2 \
> >>         params ip="192.168.*.*" \
> >>         params nic="eth0" \
> >>         params iflabel="1" \
> >>         op monitor interval="30s"
> >> primitive R_iscsi ocf:heartbeat:iscsi \
> >>         params target="target1.mydomain.com:san.targets" \
> >>         params portal="192.168.*.*" \
> >>         op monitor interval="30s" \
> >>         op start interval="0" timeout="30s" start-delay="5s" \
> >>         meta is-managed="true"
> >> primitive R_LVM ocf:heartbeat:LVM \
> >>         params volgrpname="VolGroup01" \
> >>         op monitor interval="30s" \
> >>         op start interval="0" timeout="30s" start-delay="5s" \
> >>         meta is-managed="true"
> >> primitive R_Filesystem ocf:heartbeat:Filesystem \
> >>         params device="/dev/VolGroup01/LogVol00" \
> >>         params directory="/san_targets/www" \
> >>         params fstype="ext3" \
> >>         op monitor interval="30s" \
> >>         op start interval="0" timeout="30s" start-delay="5s"
> >> primitive R_NFS ocf:heartbeat:nfsserver \
> >>         params nfs_init_script="/etc/init.d/nfs" \
> >>         params nfs_notify_cmd="/sbin/rpc.statd" \
> >>         params nfs_shared_infodir="/san_targets/www/nfsinfo" \
> >>         op monitor interval="30s"
> >> primitive drbd0 ocf:heartbeat:drbd \
> >>         params drbd_resource="drbd0" \
> >>         op monitor interval="29s" role="Master" timeout="30s" \
> >>         op monitor interval="30s" role="Slave" timeout="30s" \
> >>         op start interval="0" timeout="30s" start-delay="10s"
> >> primitive drbd1 ocf:heartbeat:drbd \
> >>         params drbd_resource="drbd1" \
> >>         op monitor interval="29s" role="Master" timeout="30s" \
> >>         op monitor interval="30s" role="Slave" timeout="30s" \
> >>         op start interval="0" timeout="30s" start-delay="10s"
> >> primitive drbd2 ocf:heartbeat:drbd \
> >>         params drbd_resource="drbd2" \
> >>         op monitor interval="29s" role="Master" timeout="30s" \
> >>         op monitor interval="30s" role="Slave" timeout="30s" \
> >>         op start interval="0" timeout="30s" start-delay="10s"
> >> primitive R_pingd ocf:pacemaker:pingd
> >> primitive R_Failover_Alert_Init ocf:heartbeat:MailTo2 \
> >>         params sender="[email protected]" \
> >>         params email="[email protected],[email protected]" \
> >>         params subject="ACS Init"
> >> primitive R_Failover_Alert_Target ocf:heartbeat:MailTo2 \
> >>         params sender="[email protected]" \
> >>         params email="[email protected],[email protected]" \
> >>         params subject="ACS San"
> >> group G_Target R_IP_Target R_tgtd \
> >>         meta target-role="Started"
> >> group G_Init R_IP_Init R_iscsi R_LVM R_Filesystem R_NFS \
> >>         meta target-role="Stopped"
> >> ms ms-drbd0 drbd0 \
> >>         meta clone-max="2" notify="true" globally-unique="false"
> >> target-role="Started"
> >> ms ms-drbd1 drbd1 \
> >>         meta clone-max="2" notify="true" globally-unique="false"
> >> target-role="Started"
> >> ms ms-drbd2 drbd2 \
> >>         meta clone-max="2" notify="true" globally-unique="false"
> >> target-role="Started"
> >> clone pingd R_pingd \
> >>         meta target-role="Started"
> >> clone Failover_Alert_Init R_Failover_Alert_Init \
> >>         meta clone-max="2" target-role="Stopped"
> >> clone Failover_Alert_Target R_Failover_Alert_Target \
> >>         meta clone-max="2" target-role="Stopped"
> >> location pingd-node-1 pingd 500: init1.mydomain.com
> >> location pingd-node-2 pingd 500: init2.mydomain.com
> >> location pingd-node-3 pingd 500: san1.mydomain.com
> >> location pingd-node-4 pingd 500: san2.mydomain.com
> >> location ms-drbd0-pref-1 ms-drbd0 200: san1.mydomain.com
> >> location ms-drbd0-pref-2 ms-drbd0 100: san2.mydomain.com
> >> location ms-drbd1-pref-1 ms-drbd1 200: san1.mydomain.com
> >> location ms-drbd1-pref-2 ms-drbd1 100: san2.mydomain.com
> >> location ms-drbd2-pref-1 ms-drbd2 200: san1.mydomain.com
> >> location ms-drbd2-pref-2 ms-drbd2 100: san2.mydomain.com
> >> location G_Target-pref-1 G_Target 200: san1.mydomain.com
> >> location G_Target-pref-2 G_Target 100: san2.mydomain.com
> >> location G_Init-pref-1 G_Init 200: init1.mydomain.com
> >> location G_Init-pref-2 G_Init 100: init2.mydomain.com
> >> location Failover-Alert-node1 Failover_Alert_Init 200: init1.mydomain.com
> >> location Failover-Alert-node2 Failover_Alert_Init 100: init2.mydomain.com
> >> location Failover-Alert-node3 Failover_Alert_Target 200:
> >> san1.mydomain.com
> >> location Failover-Alert-node4 Failover_Alert_Target 100:
> >> san2.mydomain.com
> >> colocation G_Target-on-ms-drbd0 inf: G_Target ms-drbd0:Master
> >> colocation G_Target-on-ms-drbd1 inf: G_Target ms-drbd1:Master
> >> colocation G_Target-on-ms-drbd2 inf: G_Target ms-drbd2:Master
> >> order ms-drbd0-before-ms-drbd1 inf: ms-drbd0:promote ms-drbd1:promote
> >> order ms-drbd1-before-ms-drbd2 inf: ms-drbd1:promote ms-drbd2:promote
> >> order ms-drbd2-before-G_Target inf: ms-drbd2:promote G_Target:start
> >> order G_Target-before-G_Init inf: G_Target:start G_Init:start
> >> property $id="cib-bootstrap-options" \
> >>         dc-version="1.0.3-b133b3f19797c00f9189f4b66b513963f9d25db9" \
> >>         stonith-enabled="false" \
> >>         stonith-action="reboot" \
> >>         stop-orphan-resources="true" \
> >>         stop-orphan-actions="true" \
> >>         symmetric-cluster="false" \
> >>         last-lrm-refresh="1239899583" \
> >>         default-resource-stickiness="INFINITY"
> >> 
> >> Any ideas?
> >> -- 
> >> View this message in context:
> >> http://www.nabble.com/DRBD-does-not-switch-resources-to-other-node-properly-tp23082432p23082432.html
> >> Sent from the Linux-HA mailing list archive at Nabble.com.
> >> 
> >> _______________________________________________
> >> Linux-HA mailing list
> >> [email protected]
> >> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >> See also: http://linux-ha.org/ReportingProblems
> > _______________________________________________
> > Linux-HA mailing list
> > [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> > 
> > 
> 
> -- 
> View this message in context: 
> http://www.nabble.com/DRBD-does-not-switch-resources-to-other-node-properly-tp23082432p23085508.html
> Sent from the Linux-HA mailing list archive at Nabble.com.
> 
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] DRBD does not switch resources to other node properly

Reply via email to