Re: [Linux-HA] DRBD does not switch resources to other node properly

Ethan Bannister Thu, 16 Apr 2009 13:16:25 -0700

/var/log/messages on san2 states that it couldn't promote drbd1:1 on san2
because san1 was still in primary mode.  This makes sense.  But why would it
have no issues with taking down the other drbd devices on san1 and not
drbd1?  Is there a log file that may give me a better idea of what may be
going on?  I am assuming that when I pull the cable or take down eth0, the
rest of the cluster is unable to tell san1 to demote the drbd devices so
that san2 can then promote them.  But from what I gather from this log file,
drbdadm does all of this.  So would it be safe to assume that drbdadm
communicates through the direct link between the two targets and it is
failing for drbd1 for some reason?  This is puzzling me.  I know that I am
missing something that is right under my nose :confused:


Apr 16 14:04:39 san2 lrmd: [12984]: info: rsc:drbd1:1: promote
Apr 16 14:04:40 san2 crmd: [12987]: info: process_lrm_event: LRM operation
drbd0:1_monitor_29000 (call=107, rc=8, cib-update=133, confirmed=false)
complete master
Apr 16 14:04:41 san2 lrmd: [12984]: info: RA output:
(drbd1:1:promote:stdout) /dev/drbd1: State change failed: (-1) Multiple
primaries not allowed by config Command 'drbdsetup /dev/drbd1 primary'
terminated with exit code 11
Apr 16 14:04:41 san2 drbd[6372]: [6459]: ERROR: drbd1 promote: Not primary
despite drbdadm call.
Apr 16 14:04:41 san2 crmd: [12987]: info: process_lrm_event: LRM operation
drbd1:1_promote_0 (call=108, rc=1, cib-update=134, confirmed=true) complete
unknown error
Apr 16 14:04:41 san2 kernel: drbd1: peer( Primary -> Secondary )
Apr 16 14:04:42 san2 kernel: drbd1: peer( Secondary -> Unknown ) conn(
Connected -> TearDown ) pdsk( UpToDate -> DUnknown )
Apr 16 14:04:42 san2 kernel: drbd1: Writing meta data super block now.
Apr 16 14:04:42 san2 kernel: drbd1: asender terminated
Apr 16 14:04:42 san2 kernel: drbd1: Terminating asender thread
Apr 16 14:04:42 san2 kernel: drbd1: tl_clear()
Apr 16 14:04:42 san2 kernel: drbd1: Connection closed
Apr 16 14:04:42 san2 kernel: drbd1: conn( TearDown -> Unconnected )
Apr 16 14:04:42 san2 kernel: drbd1: receiver terminated
Apr 16 14:04:42 san2 kernel: drbd1: Restarting receiver thread
Apr 16 14:04:42 san2 kernel: drbd1: receiver (re)started
Apr 16 14:04:42 san2 kernel: drbd1: conn( Unconnected -> WFConnection )
Apr 16 14:04:42 san2 crmd: [12987]: info: do_lrm_rsc_op: Performing
key=174:43:0:90b5d1cc-a955-48e8-a1a6-7a2674a8c783 op=drbd1:1_notify_0 )
Apr 16 14:04:42 san2 lrmd: [12984]: info: rsc:drbd1:1: notify
Apr 16 14:04:42 san2 crm_master: [6492]: info: Invoked: /usr/sbin/crm_master
-l reboot -v 10
Apr 16 14:04:42 san2 attrd: [12986]: info: attrd_trigger_update: Sending
flush op to all hosts for: master-drbd1:1
Apr 16 14:04:42 san2 attrd: [12986]: info: attrd_perform_update: Sent update
118: master-drbd1:1=10
Apr 16 14:04:42 san2 lrmd: [12984]: info: RA output: (drbd1:1:notify:stdout)
0 Trying master-drbd1:1=10 update via attrd
Apr 16 14:04:42 san2 crmd: [12987]: info: process_lrm_event: LRM operation
drbd1:1_notify_0 (call=109, rc=0, cib-update=135, confirmed=true) complete
ok
Apr 16 14:04:43 san2 crmd: [12987]: info: do_lrm_rsc_op: Performing
key=170:44:0:90b5d1cc-a955-48e8-a1a6-7a2674a8c783 op=drbd1:1_notify_0 )
Apr 16 14:04:43 san2 lrmd: [12984]: info: rsc:drbd1:1: notify



Dejan Muhamedagic wrote:
> 
> Hi,
> 
> On Thu, Apr 16, 2009 at 10:11:26AM -0700, Ethan Bannister wrote:
>> 
>> Perhaps someone may be able to give me a little insight on what I may be
>> doing wrong.  I would like to have DRBD promote on secondary machine when
>> the Ethernet connection to the initiator on my SAN goes down.  When I
>> pull
>> the cable or bring eth0 down which IPaddr resides on, this is what
>> crm_mon
>> shows me soon after:
>> 
>> ============
>> Last updated: Thu Apr 16 12:38:36 2009
>> Current DC: init2.mydomain.com (1d3814dc-7928-4beb-99f6-c7ade09056a5) -
>> partition with quorum
>> Version: 1.0.3-b133b3f19797c00f9189f4b66b513963f9d25db9
>> 4 Nodes configured, unknown expected votes
>> 8 Resources configured.
>> ============
>> 
>> Online: [ san2.mydomain.com init2.mydomain.com init1.mydomain.com ]
>> OFFLINE: [ san1.mydomain.com ]
>> 
>> Resource Group: G_Target
>>     R_IP_Target (ocf::heartbeat:IPaddr2):    Started san2.mydomain.com
>>     R_tgtd   (ocf::acs:tgtdra):      Started san2.mydomain.com
>> Master/Slave Set: ms-drbd0
>>         Masters: [ san2.mydomain.com ]
>>         Stopped: [ drbd0:0 ]     <---------- correct
>> Master/Slave Set: ms-drbd1
>>         Masters: [ san2.mydomain.com ]
>>         Stopped: [ drbd1:1 ]     <---------- incorrect
>> Master/Slave Set: ms-drbd2
>>         Masters: [ san2.mydomain.com ]
>>         Stopped: [ drbd2:0 ]     <---------- correct
>> Clone Set: pingd
>>         Started: [ init1.mydomain.com init2.mydomain.com
>> san2.mydomain.com ]
>>         Stopped: [ R_pingd:2 ]
>> 
>> Failed actions:
>>     drbd1:1_promote_0 (node=san2.mydomain.com, call=43, rc=1,
>> status=complete): unknown error
> 
> Does drbd report any error in the logs (look form lrmd.*drbd)?
> This looks like a resource or a drbd RA issue.
> 
> Thanks,
> 
> Dejan
> 
>> As you can see, drbd0 and drbd2 promote with no issues.  But drbd1 is not
>> promoting properly.  I have checked my constraints, and I have tweaked
>> out
>> the start-delay settings, but nothing happens the way I would like.  I
>> have
>> two initiators for redundancy as well.  But I want the initiator to stay
>> up
>> if the network goes down on either target.  This has been puzzling me for
>> some time now.  Any help would be greatly appreciated.
> 
>> Here is what I have for a crm cli config:
>> 
>> node $id="cee46f54-d517-4e4d-b0b8-3076fbc5751b" san2.mydomain.com \
>>         attributes standby="off"
>> node $id="bde24914-1235-4dc4-8686-f05fd9e6a35e" san1.mydomain.com \
>>         attributes standby="off"
>> node $id="1d3814dc-7928-4beb-99f6-c7ade09056a5" init2.mydomain.com \
>>         attributes standby="off"
>> node $id="a058cd72-b27e-4593-ac7e-d79db0709c15" init1.mydomain.com \
>>         attributes standby="off"
>> primitive R_IP_Target ocf:heartbeat:IPaddr2 \
>>         params ip="192.168.*.*" \
>>         params nic="eth0" \
>>         params iflabel="1" \
>>         op monitor interval="30s"
>> primitive R_tgtd ocf:acs:tgtdra \
>>         op monitor interval="30s" \
>>         op start interval="0" timeout="30s" start-delay="2s"
>> primitive R_IP_Init ocf:heartbeat:IPaddr2 \
>>         params ip="192.168.*.*" \
>>         params nic="eth0" \
>>         params iflabel="1" \
>>         op monitor interval="30s"
>> primitive R_iscsi ocf:heartbeat:iscsi \
>>         params target="target1.mydomain.com:san.targets" \
>>         params portal="192.168.*.*" \
>>         op monitor interval="30s" \
>>         op start interval="0" timeout="30s" start-delay="5s" \
>>         meta is-managed="true"
>> primitive R_LVM ocf:heartbeat:LVM \
>>         params volgrpname="VolGroup01" \
>>         op monitor interval="30s" \
>>         op start interval="0" timeout="30s" start-delay="5s" \
>>         meta is-managed="true"
>> primitive R_Filesystem ocf:heartbeat:Filesystem \
>>         params device="/dev/VolGroup01/LogVol00" \
>>         params directory="/san_targets/www" \
>>         params fstype="ext3" \
>>         op monitor interval="30s" \
>>         op start interval="0" timeout="30s" start-delay="5s"
>> primitive R_NFS ocf:heartbeat:nfsserver \
>>         params nfs_init_script="/etc/init.d/nfs" \
>>         params nfs_notify_cmd="/sbin/rpc.statd" \
>>         params nfs_shared_infodir="/san_targets/www/nfsinfo" \
>>         op monitor interval="30s"
>> primitive drbd0 ocf:heartbeat:drbd \
>>         params drbd_resource="drbd0" \
>>         op monitor interval="29s" role="Master" timeout="30s" \
>>         op monitor interval="30s" role="Slave" timeout="30s" \
>>         op start interval="0" timeout="30s" start-delay="10s"
>> primitive drbd1 ocf:heartbeat:drbd \
>>         params drbd_resource="drbd1" \
>>         op monitor interval="29s" role="Master" timeout="30s" \
>>         op monitor interval="30s" role="Slave" timeout="30s" \
>>         op start interval="0" timeout="30s" start-delay="10s"
>> primitive drbd2 ocf:heartbeat:drbd \
>>         params drbd_resource="drbd2" \
>>         op monitor interval="29s" role="Master" timeout="30s" \
>>         op monitor interval="30s" role="Slave" timeout="30s" \
>>         op start interval="0" timeout="30s" start-delay="10s"
>> primitive R_pingd ocf:pacemaker:pingd
>> primitive R_Failover_Alert_Init ocf:heartbeat:MailTo2 \
>>         params sender="[email protected]" \
>>         params email="[email protected],[email protected]" \
>>         params subject="ACS Init"
>> primitive R_Failover_Alert_Target ocf:heartbeat:MailTo2 \
>>         params sender="[email protected]" \
>>         params email="[email protected],[email protected]" \
>>         params subject="ACS San"
>> group G_Target R_IP_Target R_tgtd \
>>         meta target-role="Started"
>> group G_Init R_IP_Init R_iscsi R_LVM R_Filesystem R_NFS \
>>         meta target-role="Stopped"
>> ms ms-drbd0 drbd0 \
>>         meta clone-max="2" notify="true" globally-unique="false"
>> target-role="Started"
>> ms ms-drbd1 drbd1 \
>>         meta clone-max="2" notify="true" globally-unique="false"
>> target-role="Started"
>> ms ms-drbd2 drbd2 \
>>         meta clone-max="2" notify="true" globally-unique="false"
>> target-role="Started"
>> clone pingd R_pingd \
>>         meta target-role="Started"
>> clone Failover_Alert_Init R_Failover_Alert_Init \
>>         meta clone-max="2" target-role="Stopped"
>> clone Failover_Alert_Target R_Failover_Alert_Target \
>>         meta clone-max="2" target-role="Stopped"
>> location pingd-node-1 pingd 500: init1.mydomain.com
>> location pingd-node-2 pingd 500: init2.mydomain.com
>> location pingd-node-3 pingd 500: san1.mydomain.com
>> location pingd-node-4 pingd 500: san2.mydomain.com
>> location ms-drbd0-pref-1 ms-drbd0 200: san1.mydomain.com
>> location ms-drbd0-pref-2 ms-drbd0 100: san2.mydomain.com
>> location ms-drbd1-pref-1 ms-drbd1 200: san1.mydomain.com
>> location ms-drbd1-pref-2 ms-drbd1 100: san2.mydomain.com
>> location ms-drbd2-pref-1 ms-drbd2 200: san1.mydomain.com
>> location ms-drbd2-pref-2 ms-drbd2 100: san2.mydomain.com
>> location G_Target-pref-1 G_Target 200: san1.mydomain.com
>> location G_Target-pref-2 G_Target 100: san2.mydomain.com
>> location G_Init-pref-1 G_Init 200: init1.mydomain.com
>> location G_Init-pref-2 G_Init 100: init2.mydomain.com
>> location Failover-Alert-node1 Failover_Alert_Init 200: init1.mydomain.com
>> location Failover-Alert-node2 Failover_Alert_Init 100: init2.mydomain.com
>> location Failover-Alert-node3 Failover_Alert_Target 200:
>> san1.mydomain.com
>> location Failover-Alert-node4 Failover_Alert_Target 100:
>> san2.mydomain.com
>> colocation G_Target-on-ms-drbd0 inf: G_Target ms-drbd0:Master
>> colocation G_Target-on-ms-drbd1 inf: G_Target ms-drbd1:Master
>> colocation G_Target-on-ms-drbd2 inf: G_Target ms-drbd2:Master
>> order ms-drbd0-before-ms-drbd1 inf: ms-drbd0:promote ms-drbd1:promote
>> order ms-drbd1-before-ms-drbd2 inf: ms-drbd1:promote ms-drbd2:promote
>> order ms-drbd2-before-G_Target inf: ms-drbd2:promote G_Target:start
>> order G_Target-before-G_Init inf: G_Target:start G_Init:start
>> property $id="cib-bootstrap-options" \
>>         dc-version="1.0.3-b133b3f19797c00f9189f4b66b513963f9d25db9" \
>>         stonith-enabled="false" \
>>         stonith-action="reboot" \
>>         stop-orphan-resources="true" \
>>         stop-orphan-actions="true" \
>>         symmetric-cluster="false" \
>>         last-lrm-refresh="1239899583" \
>>         default-resource-stickiness="INFINITY"
>> 
>> Any ideas?
>> -- 
>> View this message in context:
>> http://www.nabble.com/DRBD-does-not-switch-resources-to-other-node-properly-tp23082432p23082432.html
>> Sent from the Linux-HA mailing list archive at Nabble.com.
>> 
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
> 
> 

-- 
View this message in context: 
http://www.nabble.com/DRBD-does-not-switch-resources-to-other-node-properly-tp23082432p23085508.html
Sent from the Linux-HA mailing list archive at Nabble.com.

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] DRBD does not switch resources to other node properly

Reply via email to