I see a lot of people on this list talking about Linux HA, DRDB, and OCFS2 I thought I would share some of my experiences with it.
We took two dell 860 servers out of the box. We setup FC5. Because we are using software striping we ended up creating LVM partitions. We decided to leave the a huge slice of the disk unallocated/unformattted for /opt and run DRBD. Our software running and working inside opt can only exist on one node at a time, but I decided to use DRBD two primaries. This way I can look at the data on both nodes, and it seemed really cool. DRDB and OCFS2 did not work correctly on the base FC5 Kernel so I ran 'yum update' and got the newest one. Ok so this is the first sticky point. Yes FC5 is a development line, but FC5 is also IMHO just as stable as any other OS. Lets agree that it cant be 'much less stable' if that makes any sense. I grabbed the lastest DBRD. Finding an OCFS2 rpm for FC5 took searching but I found it. I did get the system setup in a testing environment. I created a file on one node. Then I edited on the other. So the clustered file system works. Failover worked. All was well. I was suprised with how well things were going. I tried to implement the File system in with heartbeat. At the time HA did not have a DRBD two primary resource RA. I settled on making CLONED resources out of the /etc/init.d files. I think this might be a bad idea in an active/active setup. You might be better making your storage unmanaged. Because of the way I set this up stopping heartbeat would stop the cluster disk. This freaked out users that may have been just looking at a file on the passive node while heartbeat was restarting. Also I have found that adding resources into and out of groups or changing the order can cause resources in HA to restart unexpectedly. I felt like my setup was a minefield. A bug in one init script might cause a casade, or if something in a colocation died it was going to cause a cluster wide failover. It was hard for me to enforce the business rules that I really wanted to. Restarting something in a colocation caused a cascade. This may be more or less due to a lack of heartbeat knowledge on my side. Everything was grand in the backroom. BUT Our production configuration opens more files for logging and our software opens a lot of files in general. OCFS2 and DRBD have a noticable lag in starting our application. Could be the time to aquire file locks,etc. Whatever the case it was slower. This had a bad effect on my timeouts in my status and monitor functions. Life lesson, timeouts need to be set higher. Still this caused headaches because I had already handing the system off to the next implementer and they could not understand why configurations changes was causing cluster wide failover from node1, node2,node1,node2..... and repeatedly kicking you out of ssh as the VIP IP address kept failing back and forth. My final straws were: That a system went up to high IO wait it needed a reboot. Corrupt files that can not be deleted. Unable to reboot machine after OCFS2 and DRDB failed. This was really annoying might have been more of an OCFS problem then DRBD. Regardless I could not unmount the disk in drbd, OCFS2 and o2cb stop, unload , etc would not do it. Killing processes did not handle it neither did modprobing. Had to call out datacenter. They rebooted the wrong node. I finally gave up on the disk cluster. We have a plan to move it all back to a single server. I shut off one end of the node. It ran for about a week. The system crashed today I got this message. 'OCFS timeout accessing DRBD. OCFS2 is sorry to be fencing the system' Not as sorry as I am. _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
