Re: [Linux-HA] OCFS2 DRBD Heartbeat and high availability did not work out so well

Eddie C Wed, 08 Aug 2007 04:24:04 -0700

A shared firewire disk and a shared storage array are both options, but then
you only have a single point of failure. Theoretically a good disk array is
a very resilient single point of failure still is an SPOF.

The great thing about DRBD active/active and OCFS2 is that you eliminate the
SPOF.

Also other options scsi locking/firewire are all related to specific
hardware/sans. This solution would be very generic.

Point taken most of my problems are DRBD, OCFS2 specific problems.

I understand the points about the resource being un-managed not being a good
thing. I know a fair amount about heartbeat colocations,orders in places.
Here is a more technical description of what i was trying to do heartbeat
wise.

Resource 1 VIP IP (used IPADDR2)
Resource 2 IP route (created an RA) for this
Resource 3 Web Scraping utility (used init script)
Resource 4 Process to work with web scraping and usenet data
Resource 5 Usenet Scraping utility
Resource 6 OCFS2 (cloned)
Resource 7 DRBD (cloned)

This was my first design
Order1 - Start 7 before 6
Group1 - Resource1 and Resource2 Process 3,4,5

This worked well. but since everything was grouped a failed resource in
Group1 caused everything to fail and possibly restart/move. Anyone connected
lost connected as the VIP left and came back a few seconds later. This
scenario was deemed unacceptable.

So then i tried writing a bunch of co location rules.
Collocate 45
Collocate 34
Collocate group1 and 4
That had the same effect though as grouping. an item failed it would cause
the collocation to fail, which would take down all the other collocation.

What I really needed was away to say. I need this resource to run wherever
VIP is running. VIP should only be running on a node with the shared disk
running.
PLACE seems only to be able to tell a resource to run on a node.

So I tried that implementation

Resource 1 VIP IP --PLACE node1 100
Resource 2 IP route --PLACE node1 100
Resource 3 Web Scraping utility --PLACE node1 100
Resource 4 Process to work with web scraping and usenet data --PLACE node1
100
Resource 5 Usenet Scraping utility --PLACE node1 100

This worked well because now everything is loosely coupled, and could still
failover, but failing over the VIP and route does not fail over resource 345

So neither place nor collocation can really express I need this resource to
run only where other resource is, but if this resource can not start don't
fail the parent. But if the parent does fail I need the resource to evaluate
that and move with it. A one way dependency.

On 8/8/07, Robert Wipfel <[EMAIL PROTECTED]> wrote:
>
> >>> On Wed, Aug 8, 2007 at  1:09 AM, in message
> <[EMAIL PROTECTED]>, "Andrew
> Beekhof"
> <[EMAIL PROTECTED]> wrote:
> > On 8/8/07, Eddie C <[EMAIL PROTECTED]> wrote:
>
> [...]
>
> >> My post was more of a rant then anything else.  I was looking for 50
> people
> >> or so to read my post and say. "You must be doing something wrong. I
> run
> >> DRBD with OCFS2 multi node actice/active and MySQL and its super fast
> and
> >> never crashes on two 100 mhz laptops"
>
> I know it's a bit off topic, but for this kind of setup (super low cost
> nodes), a shared firewire disk actually seems to work quite well. You'll
> have to load the firewire module with exclusive_login = 0, and then
> both servers will happily share that $150 fw disk ;-) and it actually
> works
> well enough to run Xen VMs off shared disk, using OCFS2 in userspace
> heartbeat mode, with Heartbeat2 managing the VMs as resources. There's
> some setup info here:
> http://www.oracle.com/technology/pub/articles/hunter_rac10gr2.html
>
> > Given how little you seem to be using heartbeat, perhaps the drbd or
> > ocfs2 lists might be a more appropriate forum.
> >
> >> The concept is great an active/active disk partition and heartbeat with
> no
> >> fancy SAN's. It works for stretches 3 or 4 weeks. But then I run into a
> >> weird locked directory that I cant delete or a file owned by '?'. Or
> the
> >> partition unmounts and the system will not reboot.
>
> I haven't checked prices recently, but multi-initiator serial attached
> shared SCSI is also a recent option, with a number of vendors providing
> low cost RAID enclosures, that can be shared across more than the
> shared firewire disk limit of two nodes. E.g. Tom's hardware did a good
> review http://www.tomshardware.com/2006/04/07/going_the_sas_storage_way/
>
> Hth,
> Robert
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] OCFS2 DRBD Heartbeat and high availability did not work out so well

Reply via email to