Re: [Linux-HA] DRBD does not switch resources to other node properly

Dejan Muhamedagic Fri, 17 Apr 2009 04:50:07 -0700

Hi,

On Thu, Apr 16, 2009 at 12:24:59PM -0700, Ethan Bannister wrote:
> 
> I am attempting to take a look at /var/log/mesages to see what may be going
> on...  This is something that caught my eye on san2:
> 
> Apr 16 14:04:39 san2 lrmd: [12984]: info: rsc:drbd1:1: promote
> Apr 16 14:04:40 san2 crmd: [12987]: info: process_lrm_event: LRM operation
> drbd0:1_monitor_29000 (call=107, rc=8, cib-update=133, confirmed=false)
> complete master
> Apr 16 14:04:41 san2 lrmd: [12984]: info: RA output:
> (drbd1:1:promote:stdout) /dev/drbd1: State change failed: (-1) Multiple
> primaries not allowed by config Command 'drbdsetup /dev/drbd1 primary'
> terminated with exit code 11
> Apr 16 14:04:41 san2 drbd[6372]: [6459]: ERROR: drbd1 promote: Not primary
> despite drbdadm call.
> Apr 16 14:04:41 san2 crmd: [12987]: info: process_lrm_event: LRM operation
> drbd1:1_promote_0 (call=108, rc=1, cib-update=134, confirmed=true) complete
> unknown error
> Apr 16 14:04:41 san2 kernel: drbd1: peer( Primary -> Secondary )
> Apr 16 14:04:42 san2 kernel: drbd1: peer( Secondary -> Unknown ) conn(
> Connected -> TearDown ) pdsk( UpToDate -> DUnknown )
> Apr 16 14:04:42 san2 kernel: drbd1: Writing meta data super block now.
> Apr 16 14:04:42 san2 kernel: drbd1: asender terminated
> Apr 16 14:04:42 san2 kernel: drbd1: Terminating asender thread
> Apr 16 14:04:42 san2 kernel: drbd1: tl_clear()
> Apr 16 14:04:42 san2 kernel: drbd1: Connection closed
> Apr 16 14:04:42 san2 kernel: drbd1: conn( TearDown -> Unconnected )
> Apr 16 14:04:42 san2 kernel: drbd1: receiver terminated
> Apr 16 14:04:42 san2 kernel: drbd1: Restarting receiver thread
> Apr 16 14:04:42 san2 kernel: drbd1: receiver (re)started
> Apr 16 14:04:42 san2 kernel: drbd1: conn( Unconnected -> WFConnection )
> Apr 16 14:04:42 san2 crmd: [12987]: info: do_lrm_rsc_op: Performing
> key=174:43:0:90b5d1cc-a955-48e8-a1a6-7a2674a8c783 op=drbd1:1_notify_0 )
> Apr 16 14:04:42 san2 lrmd: [12984]: info: rsc:drbd1:1: notify
> Apr 16 14:04:42 san2 crm_master: [6492]: info: Invoked: /usr/sbin/crm_master
> -l reboot -v 10
> Apr 16 14:04:42 san2 attrd: [12986]: info: attrd_trigger_update: Sending
> flush op to all hosts for: master-drbd1:1
> Apr 16 14:04:42 san2 attrd: [12986]: info: attrd_perform_update: Sent update
> 118: master-drbd1:1=10
> Apr 16 14:04:42 san2 lrmd: [12984]: info: RA output: (drbd1:1:notify:stdout)
> 0 Trying master-drbd1:1=10 update via attrd
> Apr 16 14:04:42 san2 crmd: [12987]: info: process_lrm_event: LRM operation
> drbd1:1_notify_0 (call=109, rc=0, cib-update=135, confirmed=true) complete
> ok
> Apr 16 14:04:43 san2 crmd: [12987]: info: do_lrm_rsc_op: Performing
> key=170:44:0:90b5d1cc-a955-48e8-a1a6-7a2674a8c783 op=drbd1:1_notify_0 )
> Apr 16 14:04:43 san2 lrmd: [12984]: info: rsc:drbd1:1: notify
> 
> I take it that it is not demoting on san1 for some odd reason...


Isn't there anything in san1 logs about that?

> It exits
> with a code 11 and states that there is dual primary's are not allowed,
> which is true.  But the thing that I can't get past is why it is only doing
> this to drbd1 and not drbd0 or drbd2..

What do the logs say about these two? If they should fail to
promote, but the RA still reports success, then the RA has to be
fixed. What does drbdadm state (or was it status?) says?

> I just upgraded to the new CentOS
> 5.3 and I am using the most up to date version of pacemaker and heartbeat. 
> I am also using the RA for drbd that came with the heartbeat package.  Is
> there another log that may give me more insight?

No, I don't think so.

Thanks,

Dejan

> 
> Dejan Muhamedagic wrote:
> > 
> > Hi,
> > 
> > On Thu, Apr 16, 2009 at 10:11:26AM -0700, Ethan Bannister wrote:
> >> 
> >> Perhaps someone may be able to give me a little insight on what I may be
> >> doing wrong.  I would like to have DRBD promote on secondary machine when
> >> the Ethernet connection to the initiator on my SAN goes down.  When I
> >> pull
> >> the cable or bring eth0 down which IPaddr resides on, this is what
> >> crm_mon
> >> shows me soon after:
> >> 
> >> ============
> >> Last updated: Thu Apr 16 12:38:36 2009
> >> Current DC: init2.mydomain.com (1d3814dc-7928-4beb-99f6-c7ade09056a5) -
> >> partition with quorum
> >> Version: 1.0.3-b133b3f19797c00f9189f4b66b513963f9d25db9
> >> 4 Nodes configured, unknown expected votes
> >> 8 Resources configured.
> >> ============
> >> 
> >> Online: [ san2.mydomain.com init2.mydomain.com init1.mydomain.com ]
> >> OFFLINE: [ san1.mydomain.com ]
> >> 
> >> Resource Group: G_Target
> >>     R_IP_Target (ocf::heartbeat:IPaddr2):  Started san2.mydomain.com
> >>     R_tgtd (ocf::acs:tgtdra):      Started san2.mydomain.com
> >> Master/Slave Set: ms-drbd0
> >>         Masters: [ san2.mydomain.com ]
> >>         Stopped: [ drbd0:0 ]     <---------- correct
> >> Master/Slave Set: ms-drbd1
> >>         Masters: [ san2.mydomain.com ]
> >>         Stopped: [ drbd1:1 ]     <---------- incorrect
> >> Master/Slave Set: ms-drbd2
> >>         Masters: [ san2.mydomain.com ]
> >>         Stopped: [ drbd2:0 ]     <---------- correct
> >> Clone Set: pingd
> >>         Started: [ init1.mydomain.com init2.mydomain.com
> >> san2.mydomain.com ]
> >>         Stopped: [ R_pingd:2 ]
> >> 
> >> Failed actions:
> >>     drbd1:1_promote_0 (node=san2.mydomain.com, call=43, rc=1,
> >> status=complete): unknown error
> > 
> > Does drbd report any error in the logs (look form lrmd.*drbd)?
> > This looks like a resource or a drbd RA issue.
> > 
> > Thanks,
> > 
> > Dejan
> > 
> >> As you can see, drbd0 and drbd2 promote with no issues.  But drbd1 is not
> >> promoting properly.  I have checked my constraints, and I have tweaked
> >> out
> >> the start-delay settings, but nothing happens the way I would like.  I
> >> have
> >> two initiators for redundancy as well.  But I want the initiator to stay
> >> up
> >> if the network goes down on either target.  This has been puzzling me for
> >> some time now.  Any help would be greatly appreciated.
> > 
> >> Here is what I have for a crm cli config:
> >> 
> >> node $id="cee46f54-d517-4e4d-b0b8-3076fbc5751b" san2.mydomain.com \
> >>         attributes standby="off"
> >> node $id="bde24914-1235-4dc4-8686-f05fd9e6a35e" san1.mydomain.com \
> >>         attributes standby="off"
> >> node $id="1d3814dc-7928-4beb-99f6-c7ade09056a5" init2.mydomain.com \
> >>         attributes standby="off"
> >> node $id="a058cd72-b27e-4593-ac7e-d79db0709c15" init1.mydomain.com \
> >>         attributes standby="off"
> >> primitive R_IP_Target ocf:heartbeat:IPaddr2 \
> >>         params ip="192.168.*.*" \
> >>         params nic="eth0" \
> >>         params iflabel="1" \
> >>         op monitor interval="30s"
> >> primitive R_tgtd ocf:acs:tgtdra \
> >>         op monitor interval="30s" \
> >>         op start interval="0" timeout="30s" start-delay="2s"
> >> primitive R_IP_Init ocf:heartbeat:IPaddr2 \
> >>         params ip="192.168.*.*" \
> >>         params nic="eth0" \
> >>         params iflabel="1" \
> >>         op monitor interval="30s"
> >> primitive R_iscsi ocf:heartbeat:iscsi \
> >>         params target="target1.mydomain.com:san.targets" \
> >>         params portal="192.168.*.*" \
> >>         op monitor interval="30s" \
> >>         op start interval="0" timeout="30s" start-delay="5s" \
> >>         meta is-managed="true"
> >> primitive R_LVM ocf:heartbeat:LVM \
> >>         params volgrpname="VolGroup01" \
> >>         op monitor interval="30s" \
> >>         op start interval="0" timeout="30s" start-delay="5s" \
> >>         meta is-managed="true"
> >> primitive R_Filesystem ocf:heartbeat:Filesystem \
> >>         params device="/dev/VolGroup01/LogVol00" \
> >>         params directory="/san_targets/www" \
> >>         params fstype="ext3" \
> >>         op monitor interval="30s" \
> >>         op start interval="0" timeout="30s" start-delay="5s"
> >> primitive R_NFS ocf:heartbeat:nfsserver \
> >>         params nfs_init_script="/etc/init.d/nfs" \
> >>         params nfs_notify_cmd="/sbin/rpc.statd" \
> >>         params nfs_shared_infodir="/san_targets/www/nfsinfo" \
> >>         op monitor interval="30s"
> >> primitive drbd0 ocf:heartbeat:drbd \
> >>         params drbd_resource="drbd0" \
> >>         op monitor interval="29s" role="Master" timeout="30s" \
> >>         op monitor interval="30s" role="Slave" timeout="30s" \
> >>         op start interval="0" timeout="30s" start-delay="10s"
> >> primitive drbd1 ocf:heartbeat:drbd \
> >>         params drbd_resource="drbd1" \
> >>         op monitor interval="29s" role="Master" timeout="30s" \
> >>         op monitor interval="30s" role="Slave" timeout="30s" \
> >>         op start interval="0" timeout="30s" start-delay="10s"
> >> primitive drbd2 ocf:heartbeat:drbd \
> >>         params drbd_resource="drbd2" \
> >>         op monitor interval="29s" role="Master" timeout="30s" \
> >>         op monitor interval="30s" role="Slave" timeout="30s" \
> >>         op start interval="0" timeout="30s" start-delay="10s"
> >> primitive R_pingd ocf:pacemaker:pingd
> >> primitive R_Failover_Alert_Init ocf:heartbeat:MailTo2 \
> >>         params sender="[email protected]" \
> >>         params email="[email protected],[email protected]" \
> >>         params subject="ACS Init"
> >> primitive R_Failover_Alert_Target ocf:heartbeat:MailTo2 \
> >>         params sender="[email protected]" \
> >>         params email="[email protected],[email protected]" \
> >>         params subject="ACS San"
> >> group G_Target R_IP_Target R_tgtd \
> >>         meta target-role="Started"
> >> group G_Init R_IP_Init R_iscsi R_LVM R_Filesystem R_NFS \
> >>         meta target-role="Stopped"
> >> ms ms-drbd0 drbd0 \
> >>         meta clone-max="2" notify="true" globally-unique="false"
> >> target-role="Started"
> >> ms ms-drbd1 drbd1 \
> >>         meta clone-max="2" notify="true" globally-unique="false"
> >> target-role="Started"
> >> ms ms-drbd2 drbd2 \
> >>         meta clone-max="2" notify="true" globally-unique="false"
> >> target-role="Started"
> >> clone pingd R_pingd \
> >>         meta target-role="Started"
> >> clone Failover_Alert_Init R_Failover_Alert_Init \
> >>         meta clone-max="2" target-role="Stopped"
> >> clone Failover_Alert_Target R_Failover_Alert_Target \
> >>         meta clone-max="2" target-role="Stopped"
> >> location pingd-node-1 pingd 500: init1.mydomain.com
> >> location pingd-node-2 pingd 500: init2.mydomain.com
> >> location pingd-node-3 pingd 500: san1.mydomain.com
> >> location pingd-node-4 pingd 500: san2.mydomain.com
> >> location ms-drbd0-pref-1 ms-drbd0 200: san1.mydomain.com
> >> location ms-drbd0-pref-2 ms-drbd0 100: san2.mydomain.com
> >> location ms-drbd1-pref-1 ms-drbd1 200: san1.mydomain.com
> >> location ms-drbd1-pref-2 ms-drbd1 100: san2.mydomain.com
> >> location ms-drbd2-pref-1 ms-drbd2 200: san1.mydomain.com
> >> location ms-drbd2-pref-2 ms-drbd2 100: san2.mydomain.com
> >> location G_Target-pref-1 G_Target 200: san1.mydomain.com
> >> location G_Target-pref-2 G_Target 100: san2.mydomain.com
> >> location G_Init-pref-1 G_Init 200: init1.mydomain.com
> >> location G_Init-pref-2 G_Init 100: init2.mydomain.com
> >> location Failover-Alert-node1 Failover_Alert_Init 200: init1.mydomain.com
> >> location Failover-Alert-node2 Failover_Alert_Init 100: init2.mydomain.com
> >> location Failover-Alert-node3 Failover_Alert_Target 200:
> >> san1.mydomain.com
> >> location Failover-Alert-node4 Failover_Alert_Target 100:
> >> san2.mydomain.com
> >> colocation G_Target-on-ms-drbd0 inf: G_Target ms-drbd0:Master
> >> colocation G_Target-on-ms-drbd1 inf: G_Target ms-drbd1:Master
> >> colocation G_Target-on-ms-drbd2 inf: G_Target ms-drbd2:Master
> >> order ms-drbd0-before-ms-drbd1 inf: ms-drbd0:promote ms-drbd1:promote
> >> order ms-drbd1-before-ms-drbd2 inf: ms-drbd1:promote ms-drbd2:promote
> >> order ms-drbd2-before-G_Target inf: ms-drbd2:promote G_Target:start
> >> order G_Target-before-G_Init inf: G_Target:start G_Init:start
> >> property $id="cib-bootstrap-options" \
> >>         dc-version="1.0.3-b133b3f19797c00f9189f4b66b513963f9d25db9" \
> >>         stonith-enabled="false" \
> >>         stonith-action="reboot" \
> >>         stop-orphan-resources="true" \
> >>         stop-orphan-actions="true" \
> >>         symmetric-cluster="false" \
> >>         last-lrm-refresh="1239899583" \
> >>         default-resource-stickiness="INFINITY"
> >> 
> >> Any ideas?
> >> -- 
> >> View this message in context:
> >> http://www.nabble.com/DRBD-does-not-switch-resources-to-other-node-properly-tp23082432p23082432.html
> >> Sent from the Linux-HA mailing list archive at Nabble.com.
> >> 
> >> _______________________________________________
> >> Linux-HA mailing list
> >> [email protected]
> >> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >> See also: http://linux-ha.org/ReportingProblems
> > _______________________________________________
> > Linux-HA mailing list
> > [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> > 
> > 
> 
> -- 
> View this message in context: 
> http://www.nabble.com/DRBD-does-not-switch-resources-to-other-node-properly-tp23082432p23084716.html
> Sent from the Linux-HA mailing list archive at Nabble.com.
> 
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] DRBD does not switch resources to other node properly

Reply via email to