Lars Marowsky-Bree wrote: > On 2008-09-09T11:18:59, David Teigland <[EMAIL PROTECTED]> wrote: > > >>> For some reason our cluster splits up into two rings. >>> Scenario is: >>> node1(n1) n2 n3 n4 n5 n6 are in the ring. >>> >>> Suddenly the ring splits into two rings: >>> n1 n2 n3 got leave msg from n4 n5 n6 >>> n4 n5 n6 got leave msg from n1 n2 n3 >>> >>> After a few milliseconds the two rings joins again: >>> n1 n2 n3 got join msg from n4 n5 n6 >>> n4 n5 n6 got join msg from n1 n2 n3 >>> >>> The two ring is joined to one ring again: >>> node1(n1) n2 n3 n4 n5 n6 are in the ring. >>> >> We at RH have struggled a great deal with this exact "feature" for quite a >> long time. It's the biggest problem by far that we've had using openais. >> > > Any insights as to why this occurs? Random membership fluctuations are > ... a problem. > > Pacemaker can, AFAIK, deal with the rings healing, but the splits are > worrying, as they might cause recovery action to occur. > > > Regards, > Lars > > Fault detection as well as membership are managed by the Totem protocol. I assume the following happens:
A node P experiences a token timeout. In this case P automatically assumes that the previous token holder Q has failed and puts Q on its list of failed nodes. This means that the next membership can contain either P or Q., but not both. P initiates the establishment of a new membership by sending a Gather message. When receiving that message Q also starts to gather nodes for a competing membership. In effect, two gather phases are executed simultaneously. Some of the other nodes decide to stick with P while others side with Q. In the end two parallel memberships are established: The old ring breaks into two independent rings. Now two independent rings exist. Since we still have a multicast environment, all nodes of each ring receive all messages. Both ring leaders detect that nodes exist that do not belong to that ring. The fact that P cannot be in the same membership than Q is not an issue anymore because either node assumes the other node has been repaired, or that the communication disruption has been overcome. The two rings join and form a single ring. Ruppert -- Ruppert Koch, Ph.D. Reliable Computer Systems Consulting http://www.rcsc.de _______________________________________________ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais