[Linux-HA] OCFS2 DRBD Heartbeat and high availability did not work out so well

Eddie C Fri, 03 Aug 2007 11:41:18 -0700

I see a lot of people on this list talking about Linux HA, DRDB, and OCFS2 I
thought I would share some of my experiences with it.


We took two dell 860 servers out of the box. We setup FC5. Because we are
using software striping we ended up creating LVM partitions.
We decided to leave the a huge slice of the disk unallocated/unformattted
for /opt and run DRBD. Our software running and working inside opt can only
exist on one node at a time, but I decided to use DRBD two primaries. This
way I can look at the data on both nodes, and it seemed really cool.

DRDB and OCFS2 did not work correctly on the base FC5 Kernel so I ran 'yum
update' and got the newest one.
Ok so this is the first sticky point. Yes FC5 is a development line, but FC5
is also IMHO just as stable as any other OS. Lets agree that it cant be
'much less stable' if that makes any sense.

I grabbed the lastest DBRD. Finding an OCFS2 rpm for FC5 took searching but
I found it.

I did get the system setup in a testing environment. I created a file on one
node. Then I edited on the other. So the clustered file system works.
Failover worked. All was well. I was suprised with how well things were
going.

I tried to implement the File system in with heartbeat. At the time HA did
not have a DRBD two primary resource RA. I settled on making CLONED
resources out of the /etc/init.d files.

I think this might be a bad idea in an active/active setup. You might be
better making your storage unmanaged. Because of the way I set this up
stopping heartbeat would stop the cluster disk. This freaked out users that
may have been just looking at a file on the passive node while heartbeat was
restarting.

Also I have found that adding resources into and out of groups or changing
the order can cause resources in HA to restart unexpectedly. I felt like my
setup was a minefield. A bug in one init script might cause a casade, or if
something in a colocation died it was going to cause a cluster
wide failover. It was hard for me to enforce the business rules that I
really wanted to. Restarting something in a colocation caused a cascade.
This may be more or less due to a lack of heartbeat knowledge on my side.

Everything was grand in the backroom. BUT
Our production configuration opens more files for logging and our software
opens a lot of files in general. OCFS2 and DRBD have a noticable lag in
starting our application. Could be the time to aquire file locks,etc.
Whatever the case it was slower. This had a bad effect on my timeouts in my
status and monitor functions. Life lesson, timeouts need to be set higher.
Still this caused headaches because I had already handing the system off to
the next implementer and they could not understand why configurations
changes was causing cluster wide failover from node1, node2,node1,node2.....
and repeatedly kicking you out of ssh as the VIP IP address kept failing
back and forth.

My final straws were:

That a system went up to high IO wait it needed a reboot.

Corrupt files that can not be deleted.

Unable to reboot machine after OCFS2 and DRDB failed. This was really
annoying might have been more of an OCFS problem then DRBD. Regardless I
could not unmount the disk in drbd, OCFS2 and o2cb stop, unload , etc would
not do it. Killing processes did not handle it neither did modprobing. Had
to call out datacenter. They rebooted the wrong node.

I finally gave up on the disk cluster. We have a plan to move it all back to
a single server. I shut off one end of the node. It ran for about a
week. The system crashed today I got this message. 'OCFS timeout
accessing DRBD. OCFS2 is sorry to be fencing the system'

Not as sorry as I am.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] OCFS2 DRBD Heartbeat and high availability did not work out so well

Reply via email to