On Tue, Oct 05, 2010 at 11:47:37AM +0100, Steve Davies wrote: > Yes, the haresources model is simple. I have encountered the issue > above, and other similar issues. > > Right now I have 3 situations that I have discovered, and plan to work > around them (comments from more experienced HA users are welcome): > > Fail 1) A node needs to go active, but this fails. This causes an > attempt to go back to slave. RM does not record that it is not-active > unless it can speak to the other node. > > Solution 1) Really? I just stopped everything... Of course I should > no-longer be active! I plan to have the RM record that I am inactive > even after the failed ha_standby request, or perhaps beforehand (I'll > add a timeout I guess) This will have knock-on effects, which will > need chasing down :) > > > Fail 2) Split-brain. This restarts both nodes 'heartbeat' daemons, and > will kill a perfectly working node. > > Solution 2) An understandable solution, but sometimes it can be more > clever. I hope to add a F_SPLITBRAIN message that includes a SETWEIGHT > - This will then run an rc script on each node, and allow the 2 nodes > to fight it out. If that fails, then we'll do the restart. The script > in its simplest form can of course just do a heartbeat daemon restart > :) > > > Fail 3) If 2 nodes get split, but also get out-of-sync. Split brain is > not recognised, and when reconnected, an "Active" message is > exchanged/logged, but ignored. > > Solution 3) The "Active" message already causes a 'status' script to > run. I plan to extend this script to cause a Splitbrain alert when > appropriate to cause the same resolution as in 2) above. > > > Note, all of the above are theoretical solutions, and I do not know > when I might get round to improving them, I just thought it might be > useful to publish my findings so far given that they seem to relate to > this thread.
If you can create test cases for either of these, maybe even in a form that the "CTS" understands, that would be probably help a lot. > The "old" resource manager is beautifully lightweight, and does not > /require/ hundreds of megabytes of Python and XML libraries to > operate. I am working on keeping it lightweight so it can be used in > small systems. Wish me luck :) I certainly do. > Regards, > Steve -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
