/var/log/messages on san2 states that it couldn't promote drbd1:1 on san2 because san1 was still in primary mode. This makes sense. But why would it have no issues with taking down the other drbd devices on san1 and not drbd1? Is there a log file that may give me a better idea of what may be going on? I am assuming that when I pull the cable or take down eth0, the rest of the cluster is unable to tell san1 to demote the drbd devices so that san2 can then promote them. But from what I gather from this log file, drbdadm does all of this. So would it be safe to assume that drbdadm communicates through the direct link between the two targets and it is failing for drbd1 for some reason? This is puzzling me. I know that I am missing something that is right under my nose :confused:
Apr 16 14:04:39 san2 lrmd: [12984]: info: rsc:drbd1:1: promote Apr 16 14:04:40 san2 crmd: [12987]: info: process_lrm_event: LRM operation drbd0:1_monitor_29000 (call=107, rc=8, cib-update=133, confirmed=false) complete master Apr 16 14:04:41 san2 lrmd: [12984]: info: RA output: (drbd1:1:promote:stdout) /dev/drbd1: State change failed: (-1) Multiple primaries not allowed by config Command 'drbdsetup /dev/drbd1 primary' terminated with exit code 11 Apr 16 14:04:41 san2 drbd[6372]: [6459]: ERROR: drbd1 promote: Not primary despite drbdadm call. Apr 16 14:04:41 san2 crmd: [12987]: info: process_lrm_event: LRM operation drbd1:1_promote_0 (call=108, rc=1, cib-update=134, confirmed=true) complete unknown error Apr 16 14:04:41 san2 kernel: drbd1: peer( Primary -> Secondary ) Apr 16 14:04:42 san2 kernel: drbd1: peer( Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown ) Apr 16 14:04:42 san2 kernel: drbd1: Writing meta data super block now. Apr 16 14:04:42 san2 kernel: drbd1: asender terminated Apr 16 14:04:42 san2 kernel: drbd1: Terminating asender thread Apr 16 14:04:42 san2 kernel: drbd1: tl_clear() Apr 16 14:04:42 san2 kernel: drbd1: Connection closed Apr 16 14:04:42 san2 kernel: drbd1: conn( TearDown -> Unconnected ) Apr 16 14:04:42 san2 kernel: drbd1: receiver terminated Apr 16 14:04:42 san2 kernel: drbd1: Restarting receiver thread Apr 16 14:04:42 san2 kernel: drbd1: receiver (re)started Apr 16 14:04:42 san2 kernel: drbd1: conn( Unconnected -> WFConnection ) Apr 16 14:04:42 san2 crmd: [12987]: info: do_lrm_rsc_op: Performing key=174:43:0:90b5d1cc-a955-48e8-a1a6-7a2674a8c783 op=drbd1:1_notify_0 ) Apr 16 14:04:42 san2 lrmd: [12984]: info: rsc:drbd1:1: notify Apr 16 14:04:42 san2 crm_master: [6492]: info: Invoked: /usr/sbin/crm_master -l reboot -v 10 Apr 16 14:04:42 san2 attrd: [12986]: info: attrd_trigger_update: Sending flush op to all hosts for: master-drbd1:1 Apr 16 14:04:42 san2 attrd: [12986]: info: attrd_perform_update: Sent update 118: master-drbd1:1=10 Apr 16 14:04:42 san2 lrmd: [12984]: info: RA output: (drbd1:1:notify:stdout) 0 Trying master-drbd1:1=10 update via attrd Apr 16 14:04:42 san2 crmd: [12987]: info: process_lrm_event: LRM operation drbd1:1_notify_0 (call=109, rc=0, cib-update=135, confirmed=true) complete ok Apr 16 14:04:43 san2 crmd: [12987]: info: do_lrm_rsc_op: Performing key=170:44:0:90b5d1cc-a955-48e8-a1a6-7a2674a8c783 op=drbd1:1_notify_0 ) Apr 16 14:04:43 san2 lrmd: [12984]: info: rsc:drbd1:1: notify Dejan Muhamedagic wrote: > > Hi, > > On Thu, Apr 16, 2009 at 10:11:26AM -0700, Ethan Bannister wrote: >> >> Perhaps someone may be able to give me a little insight on what I may be >> doing wrong. I would like to have DRBD promote on secondary machine when >> the Ethernet connection to the initiator on my SAN goes down. When I >> pull >> the cable or bring eth0 down which IPaddr resides on, this is what >> crm_mon >> shows me soon after: >> >> ============ >> Last updated: Thu Apr 16 12:38:36 2009 >> Current DC: init2.mydomain.com (1d3814dc-7928-4beb-99f6-c7ade09056a5) - >> partition with quorum >> Version: 1.0.3-b133b3f19797c00f9189f4b66b513963f9d25db9 >> 4 Nodes configured, unknown expected votes >> 8 Resources configured. >> ============ >> >> Online: [ san2.mydomain.com init2.mydomain.com init1.mydomain.com ] >> OFFLINE: [ san1.mydomain.com ] >> >> Resource Group: G_Target >> R_IP_Target (ocf::heartbeat:IPaddr2): Started san2.mydomain.com >> R_tgtd (ocf::acs:tgtdra): Started san2.mydomain.com >> Master/Slave Set: ms-drbd0 >> Masters: [ san2.mydomain.com ] >> Stopped: [ drbd0:0 ] <---------- correct >> Master/Slave Set: ms-drbd1 >> Masters: [ san2.mydomain.com ] >> Stopped: [ drbd1:1 ] <---------- incorrect >> Master/Slave Set: ms-drbd2 >> Masters: [ san2.mydomain.com ] >> Stopped: [ drbd2:0 ] <---------- correct >> Clone Set: pingd >> Started: [ init1.mydomain.com init2.mydomain.com >> san2.mydomain.com ] >> Stopped: [ R_pingd:2 ] >> >> Failed actions: >> drbd1:1_promote_0 (node=san2.mydomain.com, call=43, rc=1, >> status=complete): unknown error > > Does drbd report any error in the logs (look form lrmd.*drbd)? > This looks like a resource or a drbd RA issue. > > Thanks, > > Dejan > >> As you can see, drbd0 and drbd2 promote with no issues. But drbd1 is not >> promoting properly. I have checked my constraints, and I have tweaked >> out >> the start-delay settings, but nothing happens the way I would like. I >> have >> two initiators for redundancy as well. But I want the initiator to stay >> up >> if the network goes down on either target. This has been puzzling me for >> some time now. Any help would be greatly appreciated. > >> Here is what I have for a crm cli config: >> >> node $id="cee46f54-d517-4e4d-b0b8-3076fbc5751b" san2.mydomain.com \ >> attributes standby="off" >> node $id="bde24914-1235-4dc4-8686-f05fd9e6a35e" san1.mydomain.com \ >> attributes standby="off" >> node $id="1d3814dc-7928-4beb-99f6-c7ade09056a5" init2.mydomain.com \ >> attributes standby="off" >> node $id="a058cd72-b27e-4593-ac7e-d79db0709c15" init1.mydomain.com \ >> attributes standby="off" >> primitive R_IP_Target ocf:heartbeat:IPaddr2 \ >> params ip="192.168.*.*" \ >> params nic="eth0" \ >> params iflabel="1" \ >> op monitor interval="30s" >> primitive R_tgtd ocf:acs:tgtdra \ >> op monitor interval="30s" \ >> op start interval="0" timeout="30s" start-delay="2s" >> primitive R_IP_Init ocf:heartbeat:IPaddr2 \ >> params ip="192.168.*.*" \ >> params nic="eth0" \ >> params iflabel="1" \ >> op monitor interval="30s" >> primitive R_iscsi ocf:heartbeat:iscsi \ >> params target="target1.mydomain.com:san.targets" \ >> params portal="192.168.*.*" \ >> op monitor interval="30s" \ >> op start interval="0" timeout="30s" start-delay="5s" \ >> meta is-managed="true" >> primitive R_LVM ocf:heartbeat:LVM \ >> params volgrpname="VolGroup01" \ >> op monitor interval="30s" \ >> op start interval="0" timeout="30s" start-delay="5s" \ >> meta is-managed="true" >> primitive R_Filesystem ocf:heartbeat:Filesystem \ >> params device="/dev/VolGroup01/LogVol00" \ >> params directory="/san_targets/www" \ >> params fstype="ext3" \ >> op monitor interval="30s" \ >> op start interval="0" timeout="30s" start-delay="5s" >> primitive R_NFS ocf:heartbeat:nfsserver \ >> params nfs_init_script="/etc/init.d/nfs" \ >> params nfs_notify_cmd="/sbin/rpc.statd" \ >> params nfs_shared_infodir="/san_targets/www/nfsinfo" \ >> op monitor interval="30s" >> primitive drbd0 ocf:heartbeat:drbd \ >> params drbd_resource="drbd0" \ >> op monitor interval="29s" role="Master" timeout="30s" \ >> op monitor interval="30s" role="Slave" timeout="30s" \ >> op start interval="0" timeout="30s" start-delay="10s" >> primitive drbd1 ocf:heartbeat:drbd \ >> params drbd_resource="drbd1" \ >> op monitor interval="29s" role="Master" timeout="30s" \ >> op monitor interval="30s" role="Slave" timeout="30s" \ >> op start interval="0" timeout="30s" start-delay="10s" >> primitive drbd2 ocf:heartbeat:drbd \ >> params drbd_resource="drbd2" \ >> op monitor interval="29s" role="Master" timeout="30s" \ >> op monitor interval="30s" role="Slave" timeout="30s" \ >> op start interval="0" timeout="30s" start-delay="10s" >> primitive R_pingd ocf:pacemaker:pingd >> primitive R_Failover_Alert_Init ocf:heartbeat:MailTo2 \ >> params sender="[email protected]" \ >> params email="[email protected],[email protected]" \ >> params subject="ACS Init" >> primitive R_Failover_Alert_Target ocf:heartbeat:MailTo2 \ >> params sender="[email protected]" \ >> params email="[email protected],[email protected]" \ >> params subject="ACS San" >> group G_Target R_IP_Target R_tgtd \ >> meta target-role="Started" >> group G_Init R_IP_Init R_iscsi R_LVM R_Filesystem R_NFS \ >> meta target-role="Stopped" >> ms ms-drbd0 drbd0 \ >> meta clone-max="2" notify="true" globally-unique="false" >> target-role="Started" >> ms ms-drbd1 drbd1 \ >> meta clone-max="2" notify="true" globally-unique="false" >> target-role="Started" >> ms ms-drbd2 drbd2 \ >> meta clone-max="2" notify="true" globally-unique="false" >> target-role="Started" >> clone pingd R_pingd \ >> meta target-role="Started" >> clone Failover_Alert_Init R_Failover_Alert_Init \ >> meta clone-max="2" target-role="Stopped" >> clone Failover_Alert_Target R_Failover_Alert_Target \ >> meta clone-max="2" target-role="Stopped" >> location pingd-node-1 pingd 500: init1.mydomain.com >> location pingd-node-2 pingd 500: init2.mydomain.com >> location pingd-node-3 pingd 500: san1.mydomain.com >> location pingd-node-4 pingd 500: san2.mydomain.com >> location ms-drbd0-pref-1 ms-drbd0 200: san1.mydomain.com >> location ms-drbd0-pref-2 ms-drbd0 100: san2.mydomain.com >> location ms-drbd1-pref-1 ms-drbd1 200: san1.mydomain.com >> location ms-drbd1-pref-2 ms-drbd1 100: san2.mydomain.com >> location ms-drbd2-pref-1 ms-drbd2 200: san1.mydomain.com >> location ms-drbd2-pref-2 ms-drbd2 100: san2.mydomain.com >> location G_Target-pref-1 G_Target 200: san1.mydomain.com >> location G_Target-pref-2 G_Target 100: san2.mydomain.com >> location G_Init-pref-1 G_Init 200: init1.mydomain.com >> location G_Init-pref-2 G_Init 100: init2.mydomain.com >> location Failover-Alert-node1 Failover_Alert_Init 200: init1.mydomain.com >> location Failover-Alert-node2 Failover_Alert_Init 100: init2.mydomain.com >> location Failover-Alert-node3 Failover_Alert_Target 200: >> san1.mydomain.com >> location Failover-Alert-node4 Failover_Alert_Target 100: >> san2.mydomain.com >> colocation G_Target-on-ms-drbd0 inf: G_Target ms-drbd0:Master >> colocation G_Target-on-ms-drbd1 inf: G_Target ms-drbd1:Master >> colocation G_Target-on-ms-drbd2 inf: G_Target ms-drbd2:Master >> order ms-drbd0-before-ms-drbd1 inf: ms-drbd0:promote ms-drbd1:promote >> order ms-drbd1-before-ms-drbd2 inf: ms-drbd1:promote ms-drbd2:promote >> order ms-drbd2-before-G_Target inf: ms-drbd2:promote G_Target:start >> order G_Target-before-G_Init inf: G_Target:start G_Init:start >> property $id="cib-bootstrap-options" \ >> dc-version="1.0.3-b133b3f19797c00f9189f4b66b513963f9d25db9" \ >> stonith-enabled="false" \ >> stonith-action="reboot" \ >> stop-orphan-resources="true" \ >> stop-orphan-actions="true" \ >> symmetric-cluster="false" \ >> last-lrm-refresh="1239899583" \ >> default-resource-stickiness="INFINITY" >> >> Any ideas? >> -- >> View this message in context: >> http://www.nabble.com/DRBD-does-not-switch-resources-to-other-node-properly-tp23082432p23082432.html >> Sent from the Linux-HA mailing list archive at Nabble.com. >> >> _______________________________________________ >> Linux-HA mailing list >> [email protected] >> http://lists.linux-ha.org/mailman/listinfo/linux-ha >> See also: http://linux-ha.org/ReportingProblems > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > > -- View this message in context: http://www.nabble.com/DRBD-does-not-switch-resources-to-other-node-properly-tp23082432p23085508.html Sent from the Linux-HA mailing list archive at Nabble.com. _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
