Re: [Linux-HA] Failover problems

Darren.Mansell Mon, 29 Jun 2009 08:37:40 -0700

I may have missed this but are you using the old style drbddisk RA or
the new drbd RA?


If it's the new have you ensured the init script for DRBD is turned off?

Also do you have an ordering constraint so you aren't trying to mount
the device before it is brought online?

Some info below I've put together from the clusterlabs web site for my
own config.



Open the crm and start configuring it

crm
configure

primitive drbd0 ocf:heartbeat:drbd \
params drbd_resource=hub_disk \
op monitor role=Master interval=59s timeout=30s \
op monitor role=Slave interval=60s timeout=30s

This means:

    * primitive - It's a primitive resource.
    * drbd0 - This is the name we are giving it. It's always the second
parameter. We could call this anything (within reason)
    * ocf:heartbeat:drbd - ocf means the resource agent is an OCF type,
(Open Cluster Framework), provided by heartbeat and it's the drbd RA.
    * params - Give each parameter you require here. Press tab for a
list. drbd_resource is the name you have in the DRBD config.
    * op - Put an operation on the resource...
    * monitor - Which is a monitor. You are saying monitor with this
interval and this timeout when the resource instance is a master, then
you have another monitor with different values for if it's a slave. 

ms ms-drbd0 drbd0 \
meta clone-max=2 notify=true globally-unique=false

This means:

    * ms - It's a multi-state constraint
    * ms-drbd0 - We call it this as it's a master-slave of the drbd0
resource we configured above
    * drbd0 - The resource this constraint refers to
    * meta - Specific meta information goes after this. Maximum number
of clones is 2, notify the RA on a change of role, it's not globally
unique as it's on 2 servers. 

primitive fs0 ocf:heartbeat:Filesystem \ 
params fstype=ext3 directory=/www device=/dev/drbd0 \
meta migration-threshold="50"

This means:

    * primitive fs0 - It's another primitive resource, we're calling
this fs0 for filesystem0.
    * ocf:heartbeat:Filesystem - The resource agent is type OCF,
provided by heartbeat and is the Filesystem RA. It takes care of
mounting and unmounting a filesystem on a device.
    * params - These are the parameters we pass to the RA. In this case
it's just the 3 things that mount needs to know, the FS type, where to
mount it and the device name. As we're using drbd it's /dev/drbd0 

primitive proftpd lsb:proftpd \
op monitor interval="20s" timeout="10s" \
meta migration-threshold="50"

This means:

    * It's another primitive resource called proftpd.
    * lsb:proftpd - This is an LSB resource agent (/etc/init.d script)
    * There are no parameters to pass to this init script. You can build
them in but don't have to.
    * You are putting a monitor operation on it that checks it every 20s
and times out after 10s. The monitor operation just runs
/etc/init.d/proftpd status. If it gets a return code of 0 it's working.
A return code of 3 means it's not. The init scripts have to be LSB
compliant (give the correct return codes) to work.
    * Finally the migration threshold is how many failures it can have
before it will failover to the other node. 

primitive tomcat lsb:tomcat \
op monitor interval="30s" timeout="20s" \
meta migration-threshold="50"

Should be self-explanatory by now. It's a primitive resource called
tomcat using an LSB init script called tomcat. Pacemaker will call the
init scripts status function every 30s and wait 20s for a response. If
it fails 50 times it will be migrated over to the other node.

primitive virtual-ip ocf:heartbeat:IPaddr2 \
params ip="2.21.4.45" broadcast="2.255.255.255" nic="eth0"
cidr_netmask="8" \
op monitor interval=21s timeout=5s

And again, an IPaddr2 OCF RA called virtual-ip. Give it the parameters
it needs and monitor it every 21s, timeout 5s.

group resource-group fs0 proftpd tomcat vip

Now we group all our primitive resources together into resource group
called.... resource-group (imaginative eh?)

order ms-drbd0-before-fs0 inf: ms-drbd0:promote fs0:start

This sets an order constraint called ms-drbd0-before-fs0. The inf: means
INFINITY scoring (mandatory). The ms-drbd0:promote says to first promote
that resource then the fs0:start means to then start that resource. For
info the XML of that command comes out as:

<rsc_order first="ms-drbd0" first-action="promote"
id="ms-drbd0-before-fs0" score="INFINITY" then="fs0"
then-action="start"/>

colocation res-group-on-ms-drbd0 inf: resource-group ms-drbd0:Master

This is a colocation constraint. It's to ensure certain resources have
to run together on the same node. This one is called
res-group-on-ms-drbd0 score INFINITY and resource-group has to be
colocated with ms-drbd0 as the Master.

location ms-drbd0-master-on-hub1 ms-drbd0 \
rule id="ms-drbd0-master-on-hub1-rule" role="master" 100: #uname eq hub1

Finally this is to make the migration-threshold work. The location is
called ms-drbd0-master-on-hub1 using ms-drbd0 resource as something for
the rule to stick to. The role is master for ms-drbd0 score 100 and the
uname of the node has to be hub1.

commit
end
quit

So working backwards:

   1. With a score of 100, the DRBD resource has to be on hub1
   2. The resource group resource-group has to be on the same node as
the DRBD resource. This score is INFINITY which makes it mandatory.
   3. The resource fs0 has to start after the DRBD resource has been
promoted, as we can't mount any dirs using the Filesystem resource until
it's a primary.
   4. The fs0, tomcat and proftpd resources all have a migration
threshold of 50. If any one of them goes over this it will cause some
scores to be evaluated and then action will be decided by the crm. If
the 2nd node has no issues barring the failover of resources onto it
then that resource will be failed over. As we have colocation
constraints then those will be taken into account with the evaluation. 

Finally chkconfig off drbd, tomcat and proftpd to be sure they won't
start at boot time (pacemaker will start them).

> -----Original Message-----
> From: [email protected] [mailto:linux-ha-
> [email protected]] On Behalf Of David Hoskinson
> Sent: 29 June 2009 16:12
> To: General Linux-HA mailing list
> Subject: [Linux-HA] Failover problems
> 
> I must be missing something here I hope someone can help.  I have a
> master/slave setup using latest openais/pacemaker/drbd.  System starts
up
> perfectly and if I shutdown slave, primary notices status change and
also
> notices when slave reconnects.  If I shutdown master, drbd and
services
> transfer to slave and all works well.
> 
> The problem as I see it, is that when the master comes back on line it
> reassumes the drbd and services however I am left with a split brain
for
> the
> drbd.  I get split brain messages in logs, and primary machine shows
> primary/unknown in the cat/proc/drbd.  And Slave shows slave/unknown.
I
> am
> able to manually reconnect the drives as been suggested earlier but
this
> doesn't seem to be the "normal" way in my way of thinking or am I
wrong
> with
> this.  Should it be split brain when master takes back over?  I want
to
> know
> if I am struggling over something I shouldn't be.  It just seems to me
> that
> it should seamlessly reconnect without enabling the "automatic" split
> brain
> function in drbd.
> 
> Hope this makes sense to someone...
> 
> 
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Failover problems

Reply via email to