All looks fine to me. Can you post your drbd.conf ?
> -----Original Message----- > From: [email protected] [mailto:linux-ha- > [email protected]] On Behalf Of David Hoskinson > Sent: 29 June 2009 17:24 > To: General Linux-HA mailing list > Subject: Re: [Linux-HA] Failover problems > > I have made a very simple drbd and filesystem startup and it still is > resulting in a split brain. I have to be missing something here is my > test > config. > > <cib validate-with="pacemaker-1.0" crm_feature_set="3.0.1" have-quorum="1" > admin_epoch="0" epoch="189" num_updates="0" cib-last-written="Mon Jun 29 > 11:04:50 2009" dc-uuid="mail1"> > <configuration> > <crm_config> > <cluster_property_set id="cib-bootstrap-options"> > <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" > value="1.0.4-6dede86d6105786af3a5321ccf66b44b6914f0aa"/> > <nvpair id="cib-bootstrap-options-cluster-infrastructure" > name="cluster-infrastructure" value="openais"/> > <nvpair id="cib-bootstrap-options-expected-quorum-votes" > name="expected-quorum-votes" value="2"/> > <nvpair id="cib-bootstrap-options-last-lrm-refresh" > name="last-lrm-refresh" value="1245863799"/> > <nvpair id="cib-bootstrap-options-no-quorum-policy" > name="no-quorum-policy" value="ignore"/> > <nvpair id="cib-bootstrap-options-stonith-enabled" > name="stonith-enabled" value="false"/> > <nvpair id="cib-bootstrap-options-default-resource-stickiness" > name="default-resource-stickiness" value="200"/> > </cluster_property_set> > </crm_config> > <nodes> > <node id="mail1" uname="mail1" type="normal"/> > <node id="mail2" uname="mail2" type="normal"/> > </nodes> > <resources> > <master id="ms-drbd0"> > <meta_attributes id="ms-drbd0-meta_attributes"> > <nvpair id="ms-drbd0-meta_attributes-clone-max" name="clone-max" > value="2"/> > <nvpair id="ms-drbd0-meta_attributes-notify" name="notify" > value="true"/> > <nvpair id="ms-drbd0-meta_attributes-globally-unique" > name="globally-unique" value="false"/> > <nvpair id="ms-drbd0-meta_attributes-target-role" > name="target-role" value="Started"/> > </meta_attributes> > </meta_attributes> > <primitive class="ocf" id="drbd0" provider="heartbeat" > type="drbd"> > <instance_attributes id="drbd0-instance_attributes"> > <nvpair id="drbd0-instance_attributes-drbd_resource" > name="drbd_resource" value="r0"/> > </instance_attributes> > <operations> > <op id="drbd0-monitor-59s" interval="59s" name="monitor" > role="Master" timeout="30s"/> > <op id="drbd0-monitor-60s" interval="60s" name="monitor" > role="Slave" timeout="30s"/> > </operations> > </primitive> > </master> > <primitive class="ocf" id="fs0" provider="heartbeat" > type="Filesystem"> > <instance_attributes id="fs0-instance_attributes"> > <nvpair id="fs0-instance_attributes-fstype" name="fstype" > value="ext3"/> > <nvpair id="fs0-instance_attributes-directory" name="directory" > value="/shared"/> > <nvpair id="fs0-instance_attributes-device" name="device" > value="/dev/drbd0"/> > </instance_attributes> > <meta_attributes id="fs0-meta_attributes"> > <nvpair id="fs0-meta_attributes-target-role-stopped" > name="target-role-stopped"/> > <nvpair id="fs0-meta_attributes-target-role" name="target-role" > value="Started"/> > </meta_attributes> > </primitive> > </resources> > <constraints> > <rsc_order first="ms-drbd0" first-action="promote" > id="ms-drbd-before-fs0" score="INFINITY" then="fs0" then-action="start"/> > <rsc_colocation id="fs0-on-ms-drbd0" rsc="fs0" score="INFINITY" > with-rsc="ms-drbd0" with-rsc-role="Master"/> > </constraints> > <rsc_defaults/> > <op_defaults/> > </configuration> > </cib> > > No preferred master. > > In my test... > > Crm_mon > > ============ > Last updated: Mon Jun 29 11:14:27 2009 > Stack: openais > Current DC: mail2 - partition with quorum > Version: 1.0.4-6dede86d6105786af3a5321ccf66b44b6914f0aa > 2 Nodes configured, 2 expected votes > 2 Resources configured. > ============ > > Online: [ mail1 mail2 ] > > Master/Slave Set: ms-drbd0 > Masters: [ mail1 ] > Slaves: [ mail2 ] > fs0 (ocf::heartbeat:Filesystem): Started mail1 > > Mail1 was picked as master, it recognizes mail2 and has loaded the > filesystem on mail1. > > > [r...@mail1 crm]# cat /proc/drbd > version: 8.2.6 (api:88/proto:86-88) > GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by > build...@c5-i386-build, 2008-10-03 11:42:32 > 0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r--- > ns:8 nr:0 dw:4 dr:201 al:1 bm:1 lo:0 pe:0 ua:0 ap:0 oos:0 > > > Drbd recognizes mail1 as the primary, sees the secondary and sync is > uptodate. > > When I shut down primary (mail1) in this case I see this as I should using > crm_mon on mail2: > > ============ > Last updated: Mon Jun 29 11:18:28 2009 > Stack: openais > Current DC: mail2.eng.uiowa.edu - partition WITHOUT quorum > Version: 1.0.4-6dede86d6105786af3a5321ccf66b44b6914f0aa > 2 Nodes configured, 2 expected votes > 2 Resources configured. > ============ > > Online: [ mail2 ] > OFFLINE: [ mail1 ] > > Master/Slave Set: ms-drbd0 > Masters: [ mail2 ] > Stopped: [ drbd0:0 ] > fs0 (ocf::heartbeat:Filesystem): Started mail2 > > And then as mail1 becomes available..... > > ============ > Last updated: Mon Jun 29 11:19:44 2009 > Stack: openais > Current DC: mail2 - partition with quorum > Version: 1.0.4-6dede86d6105786af3a5321ccf66b44b6914f0aa > 2 Nodes configured, 2 expected votes > 2 Resources configured. > ============ > > Online: [ mail1 mail2 ] > > Master/Slave Set: ms-drbd0 > Masters: [ mail2 ] > Slaves: [ mail1 ] > fs0 (ocf::heartbeat:Filesystem): Started mail2 > > So far so good, this is what I would expect it to say. However if I look > at > drbd again: > > [r...@mail2 ~]# cat /proc/drbd > version: 8.2.6 (api:88/proto:86-88) > GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by > build...@c5-i386-build, 2008-10-03 11:42:32 > 0: cs:WFConnection st:Primary/Unknown ds:UpToDate/DUnknown C r--- > ns:0 nr:8 dw:12 dr:197 al:1 bm:1 lo:0 pe:0 ua:0 ap:0 oos:4 > > [r...@mail1 ~]# cat /proc/drbd > version: 8.2.6 (api:88/proto:86-88) > GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by > build...@c5-i386-build, 2008-10-03 11:42:32 > 0: cs:StandAlone st:Secondary/Unknown ds:UpToDate/DUnknown r--- > ns:0 nr:0 dw:0 dr:0 al:0 bm:2 lo:0 pe:0 ua:0 ap:0 oos:8192 > > > Its split again. > > It has to be something simple that I am missing... > > > On 6/29/09 10:37 AM, "[email protected]" > <[email protected]> wrote: > > > I may have missed this but are you using the old style drbddisk RA or > > the new drbd RA? > > > > If it's the new have you ensured the init script for DRBD is turned off? > > > > Also do you have an ordering constraint so you aren't trying to mount > > the device before it is brought online? > > > > Some info below I've put together from the clusterlabs web site for my > > own config. > > > > > > > > Open the crm and start configuring it > > > > crm > > configure > > > > primitive drbd0 ocf:heartbeat:drbd \ > > params drbd_resource=hub_disk \ > > op monitor role=Master interval=59s timeout=30s \ > > op monitor role=Slave interval=60s timeout=30s > > > > This means: > > > > * primitive - It's a primitive resource. > > * drbd0 - This is the name we are giving it. It's always the second > > parameter. We could call this anything (within reason) > > * ocf:heartbeat:drbd - ocf means the resource agent is an OCF type, > > (Open Cluster Framework), provided by heartbeat and it's the drbd RA. > > * params - Give each parameter you require here. Press tab for a > > list. drbd_resource is the name you have in the DRBD config. > > * op - Put an operation on the resource... > > * monitor - Which is a monitor. You are saying monitor with this > > interval and this timeout when the resource instance is a master, then > > you have another monitor with different values for if it's a slave. > > > > ms ms-drbd0 drbd0 \ > > meta clone-max=2 notify=true globally-unique=false > > > > This means: > > > > * ms - It's a multi-state constraint > > * ms-drbd0 - We call it this as it's a master-slave of the drbd0 > > resource we configured above > > * drbd0 - The resource this constraint refers to > > * meta - Specific meta information goes after this. Maximum number > > of clones is 2, notify the RA on a change of role, it's not globally > > unique as it's on 2 servers. > > > > primitive fs0 ocf:heartbeat:Filesystem \ > > params fstype=ext3 directory=/www device=/dev/drbd0 \ > > meta migration-threshold="50" > > > > This means: > > > > * primitive fs0 - It's another primitive resource, we're calling > > this fs0 for filesystem0. > > * ocf:heartbeat:Filesystem - The resource agent is type OCF, > > provided by heartbeat and is the Filesystem RA. It takes care of > > mounting and unmounting a filesystem on a device. > > * params - These are the parameters we pass to the RA. In this case > > it's just the 3 things that mount needs to know, the FS type, where to > > mount it and the device name. As we're using drbd it's /dev/drbd0 > > > > primitive proftpd lsb:proftpd \ > > op monitor interval="20s" timeout="10s" \ > > meta migration-threshold="50" > > > > This means: > > > > * It's another primitive resource called proftpd. > > * lsb:proftpd - This is an LSB resource agent (/etc/init.d script) > > * There are no parameters to pass to this init script. You can build > > them in but don't have to. > > * You are putting a monitor operation on it that checks it every 20s > > and times out after 10s. The monitor operation just runs > > /etc/init.d/proftpd status. If it gets a return code of 0 it's working. > > A return code of 3 means it's not. The init scripts have to be LSB > > compliant (give the correct return codes) to work. > > * Finally the migration threshold is how many failures it can have > > before it will failover to the other node. > > > > primitive tomcat lsb:tomcat \ > > op monitor interval="30s" timeout="20s" \ > > meta migration-threshold="50" > > > > Should be self-explanatory by now. It's a primitive resource called > > tomcat using an LSB init script called tomcat. Pacemaker will call the > > init scripts status function every 30s and wait 20s for a response. If > > it fails 50 times it will be migrated over to the other node. > > > > primitive virtual-ip ocf:heartbeat:IPaddr2 \ > > params ip="2.21.4.45" broadcast="2.255.255.255" nic="eth0" > > cidr_netmask="8" \ > > op monitor interval=21s timeout=5s > > > > And again, an IPaddr2 OCF RA called virtual-ip. Give it the parameters > > it needs and monitor it every 21s, timeout 5s. > > > > group resource-group fs0 proftpd tomcat vip > > > > Now we group all our primitive resources together into resource group > > called.... resource-group (imaginative eh?) > > > > order ms-drbd0-before-fs0 inf: ms-drbd0:promote fs0:start > > > > This sets an order constraint called ms-drbd0-before-fs0. The inf: means > > INFINITY scoring (mandatory). The ms-drbd0:promote says to first promote > > that resource then the fs0:start means to then start that resource. For > > info the XML of that command comes out as: > > > > <rsc_order first="ms-drbd0" first-action="promote" > > id="ms-drbd0-before-fs0" score="INFINITY" then="fs0" > > then-action="start"/> > > > > colocation res-group-on-ms-drbd0 inf: resource-group ms-drbd0:Master > > > > This is a colocation constraint. It's to ensure certain resources have > > to run together on the same node. This one is called > > res-group-on-ms-drbd0 score INFINITY and resource-group has to be > > colocated with ms-drbd0 as the Master. > > > > location ms-drbd0-master-on-hub1 ms-drbd0 \ > > rule id="ms-drbd0-master-on-hub1-rule" role="master" 100: #uname eq hub1 > > > > Finally this is to make the migration-threshold work. The location is > > called ms-drbd0-master-on-hub1 using ms-drbd0 resource as something for > > the rule to stick to. The role is master for ms-drbd0 score 100 and the > > uname of the node has to be hub1. > > > > commit > > end > > quit > > > > So working backwards: > > > > 1. With a score of 100, the DRBD resource has to be on hub1 > > 2. The resource group resource-group has to be on the same node as > > the DRBD resource. This score is INFINITY which makes it mandatory. > > 3. The resource fs0 has to start after the DRBD resource has been > > promoted, as we can't mount any dirs using the Filesystem resource until > > it's a primary. > > 4. The fs0, tomcat and proftpd resources all have a migration > > threshold of 50. If any one of them goes over this it will cause some > > scores to be evaluated and then action will be decided by the crm. If > > the 2nd node has no issues barring the failover of resources onto it > > then that resource will be failed over. As we have colocation > > constraints then those will be taken into account with the evaluation. > > > > Finally chkconfig off drbd, tomcat and proftpd to be sure they won't > > start at boot time (pacemaker will start them). > > > >> -----Original Message----- > >> From: [email protected] [mailto:linux-ha- > >> [email protected]] On Behalf Of David Hoskinson > >> Sent: 29 June 2009 16:12 > >> To: General Linux-HA mailing list > >> Subject: [Linux-HA] Failover problems > >> > >> I must be missing something here I hope someone can help. I have a > >> master/slave setup using latest openais/pacemaker/drbd. System starts > > up > >> perfectly and if I shutdown slave, primary notices status change and > > also > >> notices when slave reconnects. If I shutdown master, drbd and > > services > >> transfer to slave and all works well. > >> > >> The problem as I see it, is that when the master comes back on line it > >> reassumes the drbd and services however I am left with a split brain > > for > >> the > >> drbd. I get split brain messages in logs, and primary machine shows > >> primary/unknown in the cat/proc/drbd. And Slave shows slave/unknown. > > I > >> am > >> able to manually reconnect the drives as been suggested earlier but > > this > >> doesn't seem to be the "normal" way in my way of thinking or am I > > wrong > >> with > >> this. Should it be split brain when master takes back over? I want > > to > >> know > >> if I am struggling over something I shouldn't be. It just seems to me > >> that > >> it should seamlessly reconnect without enabling the "automatic" split > >> brain > >> function in drbd. > >> > >> Hope this makes sense to someone... > >> > >> > >> _______________________________________________ > >> Linux-HA mailing list > >> [email protected] > >> http://lists.linux-ha.org/mailman/listinfo/linux-ha > >> See also: http://linux-ha.org/ReportingProblems > > _______________________________________________ > > Linux-HA mailing list > > [email protected] > > http://lists.linux-ha.org/mailman/listinfo/linux-ha > > See also: http://linux-ha.org/ReportingProblems > > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
