Hi All I have successfully set up Heartbeat / Xen / LVM / DRBD / LVM to run 3 Xen processes on each side of a pair of machines that can fail over to the sister machine as a group. They are named a1xen and b1xen.
Each group of 3 Xen consists of two Master-Master replicating MySQL servers and an application server running Point of Sale software that sales clerks log into. I have instrumented my resource scripts to show what happens when one node fails. This is the log on b1xen: Thu Dec 13 01:18:44 AST 2007 stonith reset a1xen Thu Dec 13 01:18:48 AST 2007 drbddisk vga start Thu Dec 13 01:18:48 AST 2007 lvm start VolGroupA Thu Dec 13 01:18:51 AST 2007 xen start a1my1 Thu Dec 13 01:18:51 AST 2007 xen start a1my2 Thu Dec 13 01:18:51 AST 2007 xen start a1asp Thu Dec 13 01:20:02 AST 2007 xen stop a1my1 Thu Dec 13 01:20:03 AST 2007 xen stop a1my2 Thu Dec 13 01:20:03 AST 2007 xen stop a1asp Thu Dec 13 01:20:38 AST 2007 lvm stop VolGroupA Thu Dec 13 01:20:39 AST 2007 drbddisk vga stop Node a1xen really did fail: I have flaky hardware to test with for this purpose. Node b1xen did correctly fence a1xen and took over its services. After a1xen rebooted, it correctly migrated the services back. Here is the problem: taking over the services right away like this doesn't achieve anything except to bounce the MySQL servers and irritate the users who log in only to be dumped again one minute later. What I am looking for is a way to tell the surviving node to reset the sick node and wait a while to see if it will come back before taking over its services. Any ideas? Thanks, John John Gorman Master Merchant Systems P.S. I wrote a nice external/ippower9258 stonith script to support the IPPower network power controller family. Is there some place that I should be submitting it to for other people to use? _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
