Re: [Linux-HA] OCFS2 DRBD Heartbeat and high availability did not work out so well

Eddie C Wed, 08 Aug 2007 05:23:56 -0700

Do not worry I do not need an HA support with rules/cibs etc.
Regardless of what version of HA I use, it is the DRBD and OCFS2 is
where my issues lie at this point.

That is why I called my post 'did not work out so well' instead of a
good technical description of my problem. My main theme of the post is
that the setup is complex and that I am having a lot of multi-layer
issues.

I got started on this track because as I was reading about DRBD they
have a PDF that says what DRBD.8 supports. Active/Active OCFS2 GFS is
checked. I just wanted to share my experiences with the entire
process.

While I think fundamentally all the pieces are sound, troubleshooting,
configuring, and managing DRBD, OCFS2, Heatbeat has been quite an
adventure. We all know that setup is only half the battle.

The real question is does all this setup and management constitute and
an effective strategy. With my experimenting I would have to say no. I
think the loose couple of DRDB, OCFS2, and HA is not very effective.

The system takes far longer to setup. We normally build all our
systems with kickstart. The DRBD, HA, and OCFS parts are a manual
process that have to be done after install.

Then we have to integrate all these parts with Heartbeat. It took
quite a fair amount of play before I found the option that most people
found (clones of the init.d scripts)

After all the setup the system happens to be very unstable. Systems
that will not  reboot, high IO wait. DRBD and OCFS2 kernel modules
refusing to load and unload again more of an OCFS/DRBD then an HA
problem.

Uptime is important, but time spend on management and setup is just as
important if not more. Fact is with a decent backup and a cold spare
system and a kickstart file I could probably bring this system back in
a half hour. Yet I have spent days/weeks worth or work/troubleshooting
DRDB/OCFS2 and Heartbeat. I have had to failover and back a number of
times. Also calling the data center for reboots.

On 8/8/07, Andrew Beekhof <[EMAIL PROTECTED]> wrote:
> On 8/8/07, Eddie C <[EMAIL PROTECTED]> wrote:
> > A shared firewire disk and a shared storage array are both options, but then
> > you only have a single point of failure. Theoretically a good disk array is
> > a very resilient single point of failure still is an SPOF.
> >
> > The great thing about DRBD active/active and OCFS2 is that you eliminate the
> > SPOF.
> >
> > Also other options scsi locking/firewire are all related to specific
> > hardware/sans. This solution would be very generic.
> >
> > Point taken most of my problems are DRBD, OCFS2 specific problems.
> >
> > I understand the points about the resource being un-managed not being a good
> > thing. I know a fair amount about heartbeat colocations,orders in places.
> > Here is a more technical description of what i was trying to do heartbeat
> > wise.
> >
> > Resource 1 VIP IP (used IPADDR2)
> > Resource 2 IP route (created an RA) for this
> > Resource 3 Web Scraping utility (used init script)
> > Resource 4 Process to work with web scraping and usenet data
> > Resource 5 Usenet Scraping utility
> > Resource 6 OCFS2 (cloned)
> > Resource 7 DRBD (cloned)
> >
> > This was my first design
> > Order1 - Start 7 before 6
> > Group1 - Resource1 and Resource2 Process 3,4,5
> >
> > This worked well. but since everything was grouped a failed resource in
> > Group1 caused everything to fail and possibly restart/move. Anyone connected
> > lost connected as the VIP left and came back a few seconds later. This
> > scenario was deemed unacceptable.
> >
> > So then i tried writing a bunch of co location rules.
> > Collocate 45
> > Collocate 34
> > Collocate group1 and 4
> > That had the same effect though as grouping. an item failed it would cause
> > the collocation to fail, which would take down all the other collocation.
> >
> > What I really needed was away to say. I need this resource to run wherever
> > VIP is running. VIP should only be running on a node with the shared disk
> > running.
> > PLACE seems only to be able to tell a resource to run on a node.
> >
> > So I tried that implementation
> >
> > Resource 1 VIP IP --PLACE node1 100
> > Resource 2 IP route --PLACE node1 100
> > Resource 3 Web Scraping utility --PLACE node1 100
> > Resource 4 Process to work with web scraping and usenet data --PLACE node1
> > 100
> > Resource 5 Usenet Scraping utility --PLACE node1 100
> >
> > This worked well because now everything is loosely coupled, and could still
> > failover, but failing over the VIP and route does not fail over resource 345
> >
> > So neither place nor collocation can really express I need this resource to
> > run only where other resource is, but if this resource can not start don't
> > fail the parent. But if the parent does fail I need the resource to evaluate
> > that and move with it. A one way dependency.
>
> finally some clue as to what version you're running!
>
> please update, we've been able to do one-way colocation since 2.0.8
>
> people really do make life hard on themselves when they don't provide
> the relevant information to the people they want help from
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] OCFS2 DRBD Heartbeat and high availability did not work out so well

Reply via email to