Hi,
Ryan Steele wrote:
Steven Dake wrote:
On 12/08/2010 08:36 AM, Ryan Steele wrote:
Hey,
Just noticed a problem with Corosync 1.2.0-0ubuntu1, using a two-node cluster
configuring with redundant rings, that I
found when testing my STONITH devices. When an interface on either one of the
rings fails and is marked faulty, the
interface for that same ring on the other node is also marked faulty
immediately. This means that if any interface
fails, the entire associated ring fails. I mentioned it in IRC, and it was
believed to be a bug. Here is my corosync.conf:
That is how it is supposed to work. Any interface that is faulty within
one ring will mark the entire ring faulty. To reenable the ring, run
corosync-cfgtool -r (once the faulty network condition has been repaired).
Even if every other interface on that ring is working fine? Why isn't the node
with the faulty interface segregated, so
the rest can continue to converse on that otherwise healthy ring? That would
exponentially increase the resiliency of
the rings, and it is much easier to scale with nodes than it is with
interfaces, especially with the density being a big
trend in datacenters. I can fit more twin-nodes in my racks than I can
interfaces on half a chassis.
Corosync is using the Totem Single-Ring Ordering and Membership Protocol
[1] as a base for how it manages membership of nodes, it performs the
same token passing logic found in token ring networks, but by using
Ethernet as the network infrastructure. In a token ring network, what
happened when one of the links in the ring was broken? The entire ring
was down. Therefore in order to ensure availability, token ring networks
were designed with 2 redundant rings. The same is true for this
architecture.
HTH,
Dan
1.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.37.767&rep=rep1&type=pdf
Regards
-steve
######### begin corosync.conf
compatibility: whitetank
totem {
version: 2
secauth: off
threads: 0
rrp_mode: passive
consensus: 1201
interface {
ringnumber: 0
bindnetaddr: 192.168.192.0
mcastaddr: 227.94.1.1
mcastport: 5405
}
interface {
ringnumber: 1
bindnetaddr: 10.1.0.0
mcastaddr: 227.94.1.2
mcastport: 5405
}
}
logging {
fileline: off
to_stderr: yes
to_syslog: yes
syslog_facility: daemon
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}
aisexec {
user: root
group: root
}
service {
name: pacemaker
ver: 0
}
######### end corosync.conf
Please let me know if you need anything else to help diagnose this problem.
Also, I found a typo in the error message
that appears in the logs ("adminisrtative" instead of "administrative"):
corosync[3419]: [TOTEM ] Marking seqid 66284 ringid 1 interface 10.1.1.168
FAULTY - adminisrtative intervention required.
A "corosync-cfgtool -r" fixes the issue once the link is healthy again, but
it's definitely not optimal to have one
interface failure bring down the entire ring. Again, let me know if there's
anything else I can do to assist. Thanks,
and keep up the hard work!
-Ryan
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais
--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais