Re: [Linux-HA] DRBD does not switch resources to other node properly

Ethan Bannister Thu, 16 Apr 2009 12:25:26 -0700

I am attempting to take a look at /var/log/mesages to see what may be going
on...  This is something that caught my eye on san2:


Apr 16 14:04:39 san2 lrmd: [12984]: info: rsc:drbd1:1: promote
Apr 16 14:04:40 san2 crmd: [12987]: info: process_lrm_event: LRM operation
drbd0:1_monitor_29000 (call=107, rc=8, cib-update=133, confirmed=false)
complete master
Apr 16 14:04:41 san2 lrmd: [12984]: info: RA output:
(drbd1:1:promote:stdout) /dev/drbd1: State change failed: (-1) Multiple
primaries not allowed by config Command 'drbdsetup /dev/drbd1 primary'
terminated with exit code 11
Apr 16 14:04:41 san2 drbd[6372]: [6459]: ERROR: drbd1 promote: Not primary
despite drbdadm call.
Apr 16 14:04:41 san2 crmd: [12987]: info: process_lrm_event: LRM operation
drbd1:1_promote_0 (call=108, rc=1, cib-update=134, confirmed=true) complete
unknown error
Apr 16 14:04:41 san2 kernel: drbd1: peer( Primary -> Secondary )
Apr 16 14:04:42 san2 kernel: drbd1: peer( Secondary -> Unknown ) conn(
Connected -> TearDown ) pdsk( UpToDate -> DUnknown )
Apr 16 14:04:42 san2 kernel: drbd1: Writing meta data super block now.
Apr 16 14:04:42 san2 kernel: drbd1: asender terminated
Apr 16 14:04:42 san2 kernel: drbd1: Terminating asender thread
Apr 16 14:04:42 san2 kernel: drbd1: tl_clear()
Apr 16 14:04:42 san2 kernel: drbd1: Connection closed
Apr 16 14:04:42 san2 kernel: drbd1: conn( TearDown -> Unconnected )
Apr 16 14:04:42 san2 kernel: drbd1: receiver terminated
Apr 16 14:04:42 san2 kernel: drbd1: Restarting receiver thread
Apr 16 14:04:42 san2 kernel: drbd1: receiver (re)started
Apr 16 14:04:42 san2 kernel: drbd1: conn( Unconnected -> WFConnection )
Apr 16 14:04:42 san2 crmd: [12987]: info: do_lrm_rsc_op: Performing
key=174:43:0:90b5d1cc-a955-48e8-a1a6-7a2674a8c783 op=drbd1:1_notify_0 )
Apr 16 14:04:42 san2 lrmd: [12984]: info: rsc:drbd1:1: notify
Apr 16 14:04:42 san2 crm_master: [6492]: info: Invoked: /usr/sbin/crm_master
-l reboot -v 10
Apr 16 14:04:42 san2 attrd: [12986]: info: attrd_trigger_update: Sending
flush op to all hosts for: master-drbd1:1
Apr 16 14:04:42 san2 attrd: [12986]: info: attrd_perform_update: Sent update
118: master-drbd1:1=10
Apr 16 14:04:42 san2 lrmd: [12984]: info: RA output: (drbd1:1:notify:stdout)
0 Trying master-drbd1:1=10 update via attrd
Apr 16 14:04:42 san2 crmd: [12987]: info: process_lrm_event: LRM operation
drbd1:1_notify_0 (call=109, rc=0, cib-update=135, confirmed=true) complete
ok
Apr 16 14:04:43 san2 crmd: [12987]: info: do_lrm_rsc_op: Performing
key=170:44:0:90b5d1cc-a955-48e8-a1a6-7a2674a8c783 op=drbd1:1_notify_0 )
Apr 16 14:04:43 san2 lrmd: [12984]: info: rsc:drbd1:1: notify

I take it that it is not demoting on san1 for some odd reason...  It exits
with a code 11 and states that there is dual primary's are not allowed,
which is true.  But the thing that I can't get past is why it is only doing
this to drbd1 and not drbd0 or drbd2..  I just upgraded to the new CentOS
5.3 and I am using the most up to date version of pacemaker and heartbeat. 
I am also using the RA for drbd that came with the heartbeat package.  Is
there another log that may give me more insight?


Dejan Muhamedagic wrote:
> 
> Hi,
> 
> On Thu, Apr 16, 2009 at 10:11:26AM -0700, Ethan Bannister wrote:
>> 
>> Perhaps someone may be able to give me a little insight on what I may be
>> doing wrong.  I would like to have DRBD promote on secondary machine when
>> the Ethernet connection to the initiator on my SAN goes down.  When I
>> pull
>> the cable or bring eth0 down which IPaddr resides on, this is what
>> crm_mon
>> shows me soon after:
>> 
>> ============
>> Last updated: Thu Apr 16 12:38:36 2009
>> Current DC: init2.mydomain.com (1d3814dc-7928-4beb-99f6-c7ade09056a5) -
>> partition with quorum
>> Version: 1.0.3-b133b3f19797c00f9189f4b66b513963f9d25db9
>> 4 Nodes configured, unknown expected votes
>> 8 Resources configured.
>> ============
>> 
>> Online: [ san2.mydomain.com init2.mydomain.com init1.mydomain.com ]
>> OFFLINE: [ san1.mydomain.com ]
>> 
>> Resource Group: G_Target
>>     R_IP_Target (ocf::heartbeat:IPaddr2):    Started san2.mydomain.com
>>     R_tgtd   (ocf::acs:tgtdra):      Started san2.mydomain.com
>> Master/Slave Set: ms-drbd0
>>         Masters: [ san2.mydomain.com ]
>>         Stopped: [ drbd0:0 ]     <---------- correct
>> Master/Slave Set: ms-drbd1
>>         Masters: [ san2.mydomain.com ]
>>         Stopped: [ drbd1:1 ]     <---------- incorrect
>> Master/Slave Set: ms-drbd2
>>         Masters: [ san2.mydomain.com ]
>>         Stopped: [ drbd2:0 ]     <---------- correct
>> Clone Set: pingd
>>         Started: [ init1.mydomain.com init2.mydomain.com
>> san2.mydomain.com ]
>>         Stopped: [ R_pingd:2 ]
>> 
>> Failed actions:
>>     drbd1:1_promote_0 (node=san2.mydomain.com, call=43, rc=1,
>> status=complete): unknown error
> 
> Does drbd report any error in the logs (look form lrmd.*drbd)?
> This looks like a resource or a drbd RA issue.
> 
> Thanks,
> 
> Dejan
> 
>> As you can see, drbd0 and drbd2 promote with no issues.  But drbd1 is not
>> promoting properly.  I have checked my constraints, and I have tweaked
>> out
>> the start-delay settings, but nothing happens the way I would like.  I
>> have
>> two initiators for redundancy as well.  But I want the initiator to stay
>> up
>> if the network goes down on either target.  This has been puzzling me for
>> some time now.  Any help would be greatly appreciated.
> 
>> Here is what I have for a crm cli config:
>> 
>> node $id="cee46f54-d517-4e4d-b0b8-3076fbc5751b" san2.mydomain.com \
>>         attributes standby="off"
>> node $id="bde24914-1235-4dc4-8686-f05fd9e6a35e" san1.mydomain.com \
>>         attributes standby="off"
>> node $id="1d3814dc-7928-4beb-99f6-c7ade09056a5" init2.mydomain.com \
>>         attributes standby="off"
>> node $id="a058cd72-b27e-4593-ac7e-d79db0709c15" init1.mydomain.com \
>>         attributes standby="off"
>> primitive R_IP_Target ocf:heartbeat:IPaddr2 \
>>         params ip="192.168.*.*" \
>>         params nic="eth0" \
>>         params iflabel="1" \
>>         op monitor interval="30s"
>> primitive R_tgtd ocf:acs:tgtdra \
>>         op monitor interval="30s" \
>>         op start interval="0" timeout="30s" start-delay="2s"
>> primitive R_IP_Init ocf:heartbeat:IPaddr2 \
>>         params ip="192.168.*.*" \
>>         params nic="eth0" \
>>         params iflabel="1" \
>>         op monitor interval="30s"
>> primitive R_iscsi ocf:heartbeat:iscsi \
>>         params target="target1.mydomain.com:san.targets" \
>>         params portal="192.168.*.*" \
>>         op monitor interval="30s" \
>>         op start interval="0" timeout="30s" start-delay="5s" \
>>         meta is-managed="true"
>> primitive R_LVM ocf:heartbeat:LVM \
>>         params volgrpname="VolGroup01" \
>>         op monitor interval="30s" \
>>         op start interval="0" timeout="30s" start-delay="5s" \
>>         meta is-managed="true"
>> primitive R_Filesystem ocf:heartbeat:Filesystem \
>>         params device="/dev/VolGroup01/LogVol00" \
>>         params directory="/san_targets/www" \
>>         params fstype="ext3" \
>>         op monitor interval="30s" \
>>         op start interval="0" timeout="30s" start-delay="5s"
>> primitive R_NFS ocf:heartbeat:nfsserver \
>>         params nfs_init_script="/etc/init.d/nfs" \
>>         params nfs_notify_cmd="/sbin/rpc.statd" \
>>         params nfs_shared_infodir="/san_targets/www/nfsinfo" \
>>         op monitor interval="30s"
>> primitive drbd0 ocf:heartbeat:drbd \
>>         params drbd_resource="drbd0" \
>>         op monitor interval="29s" role="Master" timeout="30s" \
>>         op monitor interval="30s" role="Slave" timeout="30s" \
>>         op start interval="0" timeout="30s" start-delay="10s"
>> primitive drbd1 ocf:heartbeat:drbd \
>>         params drbd_resource="drbd1" \
>>         op monitor interval="29s" role="Master" timeout="30s" \
>>         op monitor interval="30s" role="Slave" timeout="30s" \
>>         op start interval="0" timeout="30s" start-delay="10s"
>> primitive drbd2 ocf:heartbeat:drbd \
>>         params drbd_resource="drbd2" \
>>         op monitor interval="29s" role="Master" timeout="30s" \
>>         op monitor interval="30s" role="Slave" timeout="30s" \
>>         op start interval="0" timeout="30s" start-delay="10s"
>> primitive R_pingd ocf:pacemaker:pingd
>> primitive R_Failover_Alert_Init ocf:heartbeat:MailTo2 \
>>         params sender="[email protected]" \
>>         params email="[email protected],[email protected]" \
>>         params subject="ACS Init"
>> primitive R_Failover_Alert_Target ocf:heartbeat:MailTo2 \
>>         params sender="[email protected]" \
>>         params email="[email protected],[email protected]" \
>>         params subject="ACS San"
>> group G_Target R_IP_Target R_tgtd \
>>         meta target-role="Started"
>> group G_Init R_IP_Init R_iscsi R_LVM R_Filesystem R_NFS \
>>         meta target-role="Stopped"
>> ms ms-drbd0 drbd0 \
>>         meta clone-max="2" notify="true" globally-unique="false"
>> target-role="Started"
>> ms ms-drbd1 drbd1 \
>>         meta clone-max="2" notify="true" globally-unique="false"
>> target-role="Started"
>> ms ms-drbd2 drbd2 \
>>         meta clone-max="2" notify="true" globally-unique="false"
>> target-role="Started"
>> clone pingd R_pingd \
>>         meta target-role="Started"
>> clone Failover_Alert_Init R_Failover_Alert_Init \
>>         meta clone-max="2" target-role="Stopped"
>> clone Failover_Alert_Target R_Failover_Alert_Target \
>>         meta clone-max="2" target-role="Stopped"
>> location pingd-node-1 pingd 500: init1.mydomain.com
>> location pingd-node-2 pingd 500: init2.mydomain.com
>> location pingd-node-3 pingd 500: san1.mydomain.com
>> location pingd-node-4 pingd 500: san2.mydomain.com
>> location ms-drbd0-pref-1 ms-drbd0 200: san1.mydomain.com
>> location ms-drbd0-pref-2 ms-drbd0 100: san2.mydomain.com
>> location ms-drbd1-pref-1 ms-drbd1 200: san1.mydomain.com
>> location ms-drbd1-pref-2 ms-drbd1 100: san2.mydomain.com
>> location ms-drbd2-pref-1 ms-drbd2 200: san1.mydomain.com
>> location ms-drbd2-pref-2 ms-drbd2 100: san2.mydomain.com
>> location G_Target-pref-1 G_Target 200: san1.mydomain.com
>> location G_Target-pref-2 G_Target 100: san2.mydomain.com
>> location G_Init-pref-1 G_Init 200: init1.mydomain.com
>> location G_Init-pref-2 G_Init 100: init2.mydomain.com
>> location Failover-Alert-node1 Failover_Alert_Init 200: init1.mydomain.com
>> location Failover-Alert-node2 Failover_Alert_Init 100: init2.mydomain.com
>> location Failover-Alert-node3 Failover_Alert_Target 200:
>> san1.mydomain.com
>> location Failover-Alert-node4 Failover_Alert_Target 100:
>> san2.mydomain.com
>> colocation G_Target-on-ms-drbd0 inf: G_Target ms-drbd0:Master
>> colocation G_Target-on-ms-drbd1 inf: G_Target ms-drbd1:Master
>> colocation G_Target-on-ms-drbd2 inf: G_Target ms-drbd2:Master
>> order ms-drbd0-before-ms-drbd1 inf: ms-drbd0:promote ms-drbd1:promote
>> order ms-drbd1-before-ms-drbd2 inf: ms-drbd1:promote ms-drbd2:promote
>> order ms-drbd2-before-G_Target inf: ms-drbd2:promote G_Target:start
>> order G_Target-before-G_Init inf: G_Target:start G_Init:start
>> property $id="cib-bootstrap-options" \
>>         dc-version="1.0.3-b133b3f19797c00f9189f4b66b513963f9d25db9" \
>>         stonith-enabled="false" \
>>         stonith-action="reboot" \
>>         stop-orphan-resources="true" \
>>         stop-orphan-actions="true" \
>>         symmetric-cluster="false" \
>>         last-lrm-refresh="1239899583" \
>>         default-resource-stickiness="INFINITY"
>> 
>> Any ideas?
>> -- 
>> View this message in context:
>> http://www.nabble.com/DRBD-does-not-switch-resources-to-other-node-properly-tp23082432p23082432.html
>> Sent from the Linux-HA mailing list archive at Nabble.com.
>> 
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
> 
> 

-- 
View this message in context: 
http://www.nabble.com/DRBD-does-not-switch-resources-to-other-node-properly-tp23082432p23084716.html
Sent from the Linux-HA mailing list archive at Nabble.com.

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] DRBD does not switch resources to other node properly

Reply via email to