Re: [Openais] Faulty interface breaks entire ring

Ryan Steele Wed, 08 Dec 2010 09:19:17 -0800

Steven Dake wrote:
> On 12/08/2010 08:36 AM, Ryan Steele wrote:
>> Hey,
>>
>> Just noticed a problem with Corosync 1.2.0-0ubuntu1, using a two-node 
>> cluster configuring with redundant rings, that I
>> found when testing my STONITH devices.  When an interface on either one of 
>> the rings fails and is marked faulty, the
>> interface for that same ring on the other node is also marked faulty 
>> immediately.  This means that if any interface
>> fails, the entire associated ring fails.  I mentioned it in IRC, and it was 
>> believed to be a bug.  Here is my corosync.conf:
>>
> 
> That is how it is supposed to work.  Any interface that is faulty within
> one ring will mark the entire ring faulty.  To reenable the ring, run
> corosync-cfgtool -r (once the faulty network condition has been repaired).
>



Even if every other interface on that ring is working fine?  Why isn't the node 
with the faulty interface segregated, so
the rest can continue to converse on that otherwise healthy ring?  That would 
exponentially increase the resiliency of
the rings, and it is much easier to scale with nodes than it is with 
interfaces, especially with the density being a big
trend in datacenters.  I can fit more twin-nodes in my racks than I can 
interfaces on half a chassis.


> Regards
> -steve
> 
>> ######### begin corosync.conf
>> compatibility: whitetank
>>
>> totem {
>>    version: 2
>>    secauth: off
>>    threads: 0
>>    rrp_mode: passive
>>    consensus: 1201
>>
>>    interface {
>>       ringnumber: 0
>>       bindnetaddr: 192.168.192.0
>>       mcastaddr: 227.94.1.1
>>       mcastport: 5405
>>    }
>>
>>    interface {
>>       ringnumber: 1
>>       bindnetaddr: 10.1.0.0
>>       mcastaddr: 227.94.1.2
>>       mcastport: 5405
>>    }
>> }
>>
>> logging {
>>    fileline: off
>>    to_stderr: yes
>>    to_syslog: yes
>>    syslog_facility: daemon
>>    debug: off
>>    timestamp: on
>>    logger_subsys {
>>       subsys: AMF
>>       debug: off
>>    }
>> }
>>
>> aisexec {
>>    user:  root
>>    group: root
>> }
>>
>> service {
>>    name: pacemaker
>>    ver:  0
>> }
>> ######### end corosync.conf
>>
>> Please let me know if you need anything else to help diagnose this problem.  
>> Also, I found a typo in the error message
>> that appears in the logs ("adminisrtative" instead of "administrative"):
>>
>> corosync[3419]:   [TOTEM ] Marking seqid 66284 ringid 1 interface 10.1.1.168 
>> FAULTY - adminisrtative intervention required.
>>
>> A "corosync-cfgtool -r" fixes the issue once the link is healthy again, but 
>> it's definitely not optimal to have one
>> interface failure bring down the entire ring.  Again, let me know if there's 
>> anything else I can do to assist.  Thanks,
>> and keep up the hard work!
>>
>>
>> -Ryan
>> _______________________________________________
>> Openais mailing list
>> [email protected]
>> https://lists.linux-foundation.org/mailman/listinfo/openais
> 

-- 
Ryan Steele                                    [email protected]
Systems Administrator                          +1 215-825-2196 x758
AWeber Communications                          http://www.aweber.com
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Faulty interface breaks entire ring

Reply via email to