Hi,

Ryan Steele wrote:
Steven Dake wrote:
On 12/08/2010 08:36 AM, Ryan Steele wrote:
Hey,

Just noticed a problem with Corosync 1.2.0-0ubuntu1, using a two-node cluster 
configuring with redundant rings, that I
found when testing my STONITH devices.  When an interface on either one of the 
rings fails and is marked faulty, the
interface for that same ring on the other node is also marked faulty 
immediately.  This means that if any interface
fails, the entire associated ring fails.  I mentioned it in IRC, and it was 
believed to be a bug.  Here is my corosync.conf:

That is how it is supposed to work.  Any interface that is faulty within
one ring will mark the entire ring faulty.  To reenable the ring, run
corosync-cfgtool -r (once the faulty network condition has been repaired).



Even if every other interface on that ring is working fine?  Why isn't the node 
with the faulty interface segregated, so
the rest can continue to converse on that otherwise healthy ring?  That would 
exponentially increase the resiliency of
the rings, and it is much easier to scale with nodes than it is with 
interfaces, especially with the density being a big
trend in datacenters.  I can fit more twin-nodes in my racks than I can 
interfaces on half a chassis.


Corosync is using the Totem Single-Ring Ordering and Membership Protocol [1] as a base for how it manages membership of nodes, it performs the same token passing logic found in token ring networks, but by using Ethernet as the network infrastructure. In a token ring network, what happened when one of the links in the ring was broken? The entire ring was down. Therefore in order to ensure availability, token ring networks were designed with 2 redundant rings. The same is true for this architecture.

HTH,
Dan

1. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.37.767&rep=rep1&type=pdf
Regards
-steve

######### begin corosync.conf
compatibility: whitetank

totem {
   version: 2
   secauth: off
   threads: 0
   rrp_mode: passive
   consensus: 1201

   interface {
      ringnumber: 0
      bindnetaddr: 192.168.192.0
      mcastaddr: 227.94.1.1
      mcastport: 5405
   }

   interface {
      ringnumber: 1
      bindnetaddr: 10.1.0.0
      mcastaddr: 227.94.1.2
      mcastport: 5405
   }
}

logging {
   fileline: off
   to_stderr: yes
   to_syslog: yes
   syslog_facility: daemon
   debug: off
   timestamp: on
   logger_subsys {
      subsys: AMF
      debug: off
   }
}

aisexec {
   user:  root
   group: root
}

service {
   name: pacemaker
   ver:  0
}
######### end corosync.conf

Please let me know if you need anything else to help diagnose this problem.  
Also, I found a typo in the error message
that appears in the logs ("adminisrtative" instead of "administrative"):

corosync[3419]:   [TOTEM ] Marking seqid 66284 ringid 1 interface 10.1.1.168 
FAULTY - adminisrtative intervention required.

A "corosync-cfgtool -r" fixes the issue once the link is healthy again, but 
it's definitely not optimal to have one
interface failure bring down the entire ring.  Again, let me know if there's 
anything else I can do to assist.  Thanks,
and keep up the hard work!


-Ryan
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to