Hi, On Thu, Apr 16, 2009 at 01:15:55PM -0700, Ethan Bannister wrote: > > /var/log/messages on san2 states that it couldn't promote drbd1:1 on san2 > because san1 was still in primary mode. This makes sense. But why would it > have no issues with taking down the other drbd devices on san1 and not > drbd1? Is there a log file that may give me a better idea of what may be > going on? I am assuming that when I pull the cable or take down eth0, the > rest of the cluster is unable to tell san1 to demote the drbd devices so > that san2 can then promote them. But from what I gather from this log file, > drbdadm does all of this. So would it be safe to assume that drbdadm > communicates through the direct link between the two targets and it is > failing for drbd1 for some reason?
AFAIK, drbd is using just one link. If that cable is pulled, then you have a drbd split brain. BTW, you may want to take a look at dopd to have heartbeat help drbd in this case. Thanks, Dejan > This is puzzling me. I know that I am > missing something that is right under my nose :confused: > > Apr 16 14:04:39 san2 lrmd: [12984]: info: rsc:drbd1:1: promote > Apr 16 14:04:40 san2 crmd: [12987]: info: process_lrm_event: LRM operation > drbd0:1_monitor_29000 (call=107, rc=8, cib-update=133, confirmed=false) > complete master > Apr 16 14:04:41 san2 lrmd: [12984]: info: RA output: > (drbd1:1:promote:stdout) /dev/drbd1: State change failed: (-1) Multiple > primaries not allowed by config Command 'drbdsetup /dev/drbd1 primary' > terminated with exit code 11 > Apr 16 14:04:41 san2 drbd[6372]: [6459]: ERROR: drbd1 promote: Not primary > despite drbdadm call. > Apr 16 14:04:41 san2 crmd: [12987]: info: process_lrm_event: LRM operation > drbd1:1_promote_0 (call=108, rc=1, cib-update=134, confirmed=true) complete > unknown error > Apr 16 14:04:41 san2 kernel: drbd1: peer( Primary -> Secondary ) > Apr 16 14:04:42 san2 kernel: drbd1: peer( Secondary -> Unknown ) conn( > Connected -> TearDown ) pdsk( UpToDate -> DUnknown ) > Apr 16 14:04:42 san2 kernel: drbd1: Writing meta data super block now. > Apr 16 14:04:42 san2 kernel: drbd1: asender terminated > Apr 16 14:04:42 san2 kernel: drbd1: Terminating asender thread > Apr 16 14:04:42 san2 kernel: drbd1: tl_clear() > Apr 16 14:04:42 san2 kernel: drbd1: Connection closed > Apr 16 14:04:42 san2 kernel: drbd1: conn( TearDown -> Unconnected ) > Apr 16 14:04:42 san2 kernel: drbd1: receiver terminated > Apr 16 14:04:42 san2 kernel: drbd1: Restarting receiver thread > Apr 16 14:04:42 san2 kernel: drbd1: receiver (re)started > Apr 16 14:04:42 san2 kernel: drbd1: conn( Unconnected -> WFConnection ) > Apr 16 14:04:42 san2 crmd: [12987]: info: do_lrm_rsc_op: Performing > key=174:43:0:90b5d1cc-a955-48e8-a1a6-7a2674a8c783 op=drbd1:1_notify_0 ) > Apr 16 14:04:42 san2 lrmd: [12984]: info: rsc:drbd1:1: notify > Apr 16 14:04:42 san2 crm_master: [6492]: info: Invoked: /usr/sbin/crm_master > -l reboot -v 10 > Apr 16 14:04:42 san2 attrd: [12986]: info: attrd_trigger_update: Sending > flush op to all hosts for: master-drbd1:1 > Apr 16 14:04:42 san2 attrd: [12986]: info: attrd_perform_update: Sent update > 118: master-drbd1:1=10 > Apr 16 14:04:42 san2 lrmd: [12984]: info: RA output: (drbd1:1:notify:stdout) > 0 Trying master-drbd1:1=10 update via attrd > Apr 16 14:04:42 san2 crmd: [12987]: info: process_lrm_event: LRM operation > drbd1:1_notify_0 (call=109, rc=0, cib-update=135, confirmed=true) complete > ok > Apr 16 14:04:43 san2 crmd: [12987]: info: do_lrm_rsc_op: Performing > key=170:44:0:90b5d1cc-a955-48e8-a1a6-7a2674a8c783 op=drbd1:1_notify_0 ) > Apr 16 14:04:43 san2 lrmd: [12984]: info: rsc:drbd1:1: notify > > > > Dejan Muhamedagic wrote: > > > > Hi, > > > > On Thu, Apr 16, 2009 at 10:11:26AM -0700, Ethan Bannister wrote: > >> > >> Perhaps someone may be able to give me a little insight on what I may be > >> doing wrong. I would like to have DRBD promote on secondary machine when > >> the Ethernet connection to the initiator on my SAN goes down. When I > >> pull > >> the cable or bring eth0 down which IPaddr resides on, this is what > >> crm_mon > >> shows me soon after: > >> > >> ============ > >> Last updated: Thu Apr 16 12:38:36 2009 > >> Current DC: init2.mydomain.com (1d3814dc-7928-4beb-99f6-c7ade09056a5) - > >> partition with quorum > >> Version: 1.0.3-b133b3f19797c00f9189f4b66b513963f9d25db9 > >> 4 Nodes configured, unknown expected votes > >> 8 Resources configured. > >> ============ > >> > >> Online: [ san2.mydomain.com init2.mydomain.com init1.mydomain.com ] > >> OFFLINE: [ san1.mydomain.com ] > >> > >> Resource Group: G_Target > >> R_IP_Target (ocf::heartbeat:IPaddr2): Started san2.mydomain.com > >> R_tgtd (ocf::acs:tgtdra): Started san2.mydomain.com > >> Master/Slave Set: ms-drbd0 > >> Masters: [ san2.mydomain.com ] > >> Stopped: [ drbd0:0 ] <---------- correct > >> Master/Slave Set: ms-drbd1 > >> Masters: [ san2.mydomain.com ] > >> Stopped: [ drbd1:1 ] <---------- incorrect > >> Master/Slave Set: ms-drbd2 > >> Masters: [ san2.mydomain.com ] > >> Stopped: [ drbd2:0 ] <---------- correct > >> Clone Set: pingd > >> Started: [ init1.mydomain.com init2.mydomain.com > >> san2.mydomain.com ] > >> Stopped: [ R_pingd:2 ] > >> > >> Failed actions: > >> drbd1:1_promote_0 (node=san2.mydomain.com, call=43, rc=1, > >> status=complete): unknown error > > > > Does drbd report any error in the logs (look form lrmd.*drbd)? > > This looks like a resource or a drbd RA issue. > > > > Thanks, > > > > Dejan > > > >> As you can see, drbd0 and drbd2 promote with no issues. But drbd1 is not > >> promoting properly. I have checked my constraints, and I have tweaked > >> out > >> the start-delay settings, but nothing happens the way I would like. I > >> have > >> two initiators for redundancy as well. But I want the initiator to stay > >> up > >> if the network goes down on either target. This has been puzzling me for > >> some time now. Any help would be greatly appreciated. > > > >> Here is what I have for a crm cli config: > >> > >> node $id="cee46f54-d517-4e4d-b0b8-3076fbc5751b" san2.mydomain.com \ > >> attributes standby="off" > >> node $id="bde24914-1235-4dc4-8686-f05fd9e6a35e" san1.mydomain.com \ > >> attributes standby="off" > >> node $id="1d3814dc-7928-4beb-99f6-c7ade09056a5" init2.mydomain.com \ > >> attributes standby="off" > >> node $id="a058cd72-b27e-4593-ac7e-d79db0709c15" init1.mydomain.com \ > >> attributes standby="off" > >> primitive R_IP_Target ocf:heartbeat:IPaddr2 \ > >> params ip="192.168.*.*" \ > >> params nic="eth0" \ > >> params iflabel="1" \ > >> op monitor interval="30s" > >> primitive R_tgtd ocf:acs:tgtdra \ > >> op monitor interval="30s" \ > >> op start interval="0" timeout="30s" start-delay="2s" > >> primitive R_IP_Init ocf:heartbeat:IPaddr2 \ > >> params ip="192.168.*.*" \ > >> params nic="eth0" \ > >> params iflabel="1" \ > >> op monitor interval="30s" > >> primitive R_iscsi ocf:heartbeat:iscsi \ > >> params target="target1.mydomain.com:san.targets" \ > >> params portal="192.168.*.*" \ > >> op monitor interval="30s" \ > >> op start interval="0" timeout="30s" start-delay="5s" \ > >> meta is-managed="true" > >> primitive R_LVM ocf:heartbeat:LVM \ > >> params volgrpname="VolGroup01" \ > >> op monitor interval="30s" \ > >> op start interval="0" timeout="30s" start-delay="5s" \ > >> meta is-managed="true" > >> primitive R_Filesystem ocf:heartbeat:Filesystem \ > >> params device="/dev/VolGroup01/LogVol00" \ > >> params directory="/san_targets/www" \ > >> params fstype="ext3" \ > >> op monitor interval="30s" \ > >> op start interval="0" timeout="30s" start-delay="5s" > >> primitive R_NFS ocf:heartbeat:nfsserver \ > >> params nfs_init_script="/etc/init.d/nfs" \ > >> params nfs_notify_cmd="/sbin/rpc.statd" \ > >> params nfs_shared_infodir="/san_targets/www/nfsinfo" \ > >> op monitor interval="30s" > >> primitive drbd0 ocf:heartbeat:drbd \ > >> params drbd_resource="drbd0" \ > >> op monitor interval="29s" role="Master" timeout="30s" \ > >> op monitor interval="30s" role="Slave" timeout="30s" \ > >> op start interval="0" timeout="30s" start-delay="10s" > >> primitive drbd1 ocf:heartbeat:drbd \ > >> params drbd_resource="drbd1" \ > >> op monitor interval="29s" role="Master" timeout="30s" \ > >> op monitor interval="30s" role="Slave" timeout="30s" \ > >> op start interval="0" timeout="30s" start-delay="10s" > >> primitive drbd2 ocf:heartbeat:drbd \ > >> params drbd_resource="drbd2" \ > >> op monitor interval="29s" role="Master" timeout="30s" \ > >> op monitor interval="30s" role="Slave" timeout="30s" \ > >> op start interval="0" timeout="30s" start-delay="10s" > >> primitive R_pingd ocf:pacemaker:pingd > >> primitive R_Failover_Alert_Init ocf:heartbeat:MailTo2 \ > >> params sender="[email protected]" \ > >> params email="[email protected],[email protected]" \ > >> params subject="ACS Init" > >> primitive R_Failover_Alert_Target ocf:heartbeat:MailTo2 \ > >> params sender="[email protected]" \ > >> params email="[email protected],[email protected]" \ > >> params subject="ACS San" > >> group G_Target R_IP_Target R_tgtd \ > >> meta target-role="Started" > >> group G_Init R_IP_Init R_iscsi R_LVM R_Filesystem R_NFS \ > >> meta target-role="Stopped" > >> ms ms-drbd0 drbd0 \ > >> meta clone-max="2" notify="true" globally-unique="false" > >> target-role="Started" > >> ms ms-drbd1 drbd1 \ > >> meta clone-max="2" notify="true" globally-unique="false" > >> target-role="Started" > >> ms ms-drbd2 drbd2 \ > >> meta clone-max="2" notify="true" globally-unique="false" > >> target-role="Started" > >> clone pingd R_pingd \ > >> meta target-role="Started" > >> clone Failover_Alert_Init R_Failover_Alert_Init \ > >> meta clone-max="2" target-role="Stopped" > >> clone Failover_Alert_Target R_Failover_Alert_Target \ > >> meta clone-max="2" target-role="Stopped" > >> location pingd-node-1 pingd 500: init1.mydomain.com > >> location pingd-node-2 pingd 500: init2.mydomain.com > >> location pingd-node-3 pingd 500: san1.mydomain.com > >> location pingd-node-4 pingd 500: san2.mydomain.com > >> location ms-drbd0-pref-1 ms-drbd0 200: san1.mydomain.com > >> location ms-drbd0-pref-2 ms-drbd0 100: san2.mydomain.com > >> location ms-drbd1-pref-1 ms-drbd1 200: san1.mydomain.com > >> location ms-drbd1-pref-2 ms-drbd1 100: san2.mydomain.com > >> location ms-drbd2-pref-1 ms-drbd2 200: san1.mydomain.com > >> location ms-drbd2-pref-2 ms-drbd2 100: san2.mydomain.com > >> location G_Target-pref-1 G_Target 200: san1.mydomain.com > >> location G_Target-pref-2 G_Target 100: san2.mydomain.com > >> location G_Init-pref-1 G_Init 200: init1.mydomain.com > >> location G_Init-pref-2 G_Init 100: init2.mydomain.com > >> location Failover-Alert-node1 Failover_Alert_Init 200: init1.mydomain.com > >> location Failover-Alert-node2 Failover_Alert_Init 100: init2.mydomain.com > >> location Failover-Alert-node3 Failover_Alert_Target 200: > >> san1.mydomain.com > >> location Failover-Alert-node4 Failover_Alert_Target 100: > >> san2.mydomain.com > >> colocation G_Target-on-ms-drbd0 inf: G_Target ms-drbd0:Master > >> colocation G_Target-on-ms-drbd1 inf: G_Target ms-drbd1:Master > >> colocation G_Target-on-ms-drbd2 inf: G_Target ms-drbd2:Master > >> order ms-drbd0-before-ms-drbd1 inf: ms-drbd0:promote ms-drbd1:promote > >> order ms-drbd1-before-ms-drbd2 inf: ms-drbd1:promote ms-drbd2:promote > >> order ms-drbd2-before-G_Target inf: ms-drbd2:promote G_Target:start > >> order G_Target-before-G_Init inf: G_Target:start G_Init:start > >> property $id="cib-bootstrap-options" \ > >> dc-version="1.0.3-b133b3f19797c00f9189f4b66b513963f9d25db9" \ > >> stonith-enabled="false" \ > >> stonith-action="reboot" \ > >> stop-orphan-resources="true" \ > >> stop-orphan-actions="true" \ > >> symmetric-cluster="false" \ > >> last-lrm-refresh="1239899583" \ > >> default-resource-stickiness="INFINITY" > >> > >> Any ideas? > >> -- > >> View this message in context: > >> http://www.nabble.com/DRBD-does-not-switch-resources-to-other-node-properly-tp23082432p23082432.html > >> Sent from the Linux-HA mailing list archive at Nabble.com. > >> > >> _______________________________________________ > >> Linux-HA mailing list > >> [email protected] > >> http://lists.linux-ha.org/mailman/listinfo/linux-ha > >> See also: http://linux-ha.org/ReportingProblems > > _______________________________________________ > > Linux-HA mailing list > > [email protected] > > http://lists.linux-ha.org/mailman/listinfo/linux-ha > > See also: http://linux-ha.org/ReportingProblems > > > > > > -- > View this message in context: > http://www.nabble.com/DRBD-does-not-switch-resources-to-other-node-properly-tp23082432p23085508.html > Sent from the Linux-HA mailing list archive at Nabble.com. > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
