On Thu, Aug 11, 2011 at 8:12 PM, Digimer <li...@alteeve.com> wrote: > On 08/11/2011 12:58 PM, Alex Forster wrote: >> I have a two node Pacemaker/Corosync cluster with no resources configured >> yet. >> I'm running RHEL 6.1 with the official 1.1.5-5.el6 package. >> >> While doing various network configuration, I happened to notice that if I >> issue >> a "service network restart" on one node, then approx. four seconds later >> issue >> "service network restart" on the second node, the two nodes become split >> brain, >> each thinking the other is offline. >> >> Obviously, issuing 'service network restarts' four seconds apart will not be >> a >> common occurrence in production, but it concerns me that I can 'trick' the >> nodes >> into becoming split-brain so easily. Is there some way I can configure >> Corosync >> to quickly recover from this scenario?
man corosync.conf You can increase the value for rrp_problem_count_timeout for this. rrp_problem_count_timeout This specifies the time in milliseconds to wait before decrementing the problem count by 1 for a particular ring to ensure a link is not marked faulty for tran‐ sient network failures. The default is 2000 milliseconds. This, however, will cause issues further along the way so you need to take into consideration the timeouts that resources will have, as well as monitor operations as to include the added time from modifying this value. Regards, Dan p.s.: don't mess with rrp_problem_count_threshold unless you also consider that (rrp_problem_count_threshold * rrp_token_expired_timeout) < (token - 50ms) => (10 * 47) < (1000 - 50) => 470 < 950 (this is the default, changing rrp_problem_count_threshold to a higher value would also mean changing the token timeout and/or other parameters, so it would be best to plan ahead). >> >> Alex > > Configuring fence (stonith) will protect against split-brain by causing > the remote node to be forced offline (rough, but better than split-brain). > > -- > Digimer > E-Mail: digi...@alteeve.com > Freenode handle: digimer > Papers and Projects: http://alteeve.com > Node Assassin: http://nodeassassin.org > "At what point did we forget that the Space Shuttle was, essentially, > a program that strapped human beings to an explosion and tried to stab > through the sky with fire and math?" > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > -- Dan Frincu CCNA, RHCE _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker