On Thu, Aug 11, 2011 at 8:12 PM, Digimer <li...@alteeve.com> wrote:
> On 08/11/2011 12:58 PM, Alex Forster wrote:
>> I have a two node Pacemaker/Corosync cluster with no resources configured 
>> yet.
>> I'm running RHEL 6.1 with the official 1.1.5-5.el6 package.
>>
>> While doing various network configuration, I happened to notice that if I 
>> issue
>> a "service network restart" on one node, then approx. four seconds later 
>> issue
>> "service network restart" on the second node, the two nodes become split 
>> brain,
>> each thinking the other is offline.
>>
>> Obviously, issuing 'service network restarts' four seconds apart will not be 
>> a
>> common occurrence in production, but it concerns me that I can 'trick' the 
>> nodes
>> into becoming split-brain so easily. Is there some way I can configure 
>> Corosync
>> to quickly recover from this scenario?

man corosync.conf
You can increase the value for rrp_problem_count_timeout for this.

rrp_problem_count_timeout
              This specifies the time in milliseconds to wait before
decrementing the problem count by 1 for a particular ring to ensure a
link is not marked faulty for tran‐
              sient network failures.

              The default is 2000 milliseconds.

This, however, will cause issues further along the way so you need to
take into consideration the timeouts that resources will have, as well
as monitor operations as to include the added time from modifying this
value.

Regards,
Dan

p.s.: don't mess with rrp_problem_count_threshold unless you also
consider that (rrp_problem_count_threshold *
rrp_token_expired_timeout) < (token - 50ms) => (10 * 47) < (1000 - 50)
=> 470 < 950 (this is the default, changing
rrp_problem_count_threshold to a higher value would also mean changing
the token timeout and/or other parameters, so it would be best to plan
ahead).

>>
>> Alex
>
> Configuring fence (stonith) will protect against split-brain by causing
> the remote node to be forced offline (rough, but better than split-brain).
>
> --
> Digimer
> E-Mail:              digi...@alteeve.com
> Freenode handle:     digimer
> Papers and Projects: http://alteeve.com
> Node Assassin:       http://nodeassassin.org
> "At what point did we forget that the Space Shuttle was, essentially,
> a program that strapped human beings to an explosion and tried to stab
> through the sky with fire and math?"
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>



-- 
Dan Frincu
CCNA, RHCE

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Reply via email to