Hi , It is hard to tell what is happening without logs from all 3 nodes. Does this only happen at system start, or can you duplicate 5 minutes after systems have started?
>>> The cluster is never stabilizing. It keeps on switching between the membership and operational state. Below is the test network which i am using: [image: Untitled.png] >>> N1 and N3 does not reveive any packets from each other. Here what i expected was that either (N1,N2) or (N2, N3) forms a two node cluster and stabilizes. But the cluster is never stabilizing even though 2 node clusters are forming, it is going back to membership [I checked the logs and it looks like because of the steps i mentioned in the previous mail, this seems to be happening] Regards, Ranjith On Fri, Sep 24, 2010 at 11:36 PM, Steven Dake <[email protected]> wrote: > It is hard to tell what is happening without logs from all 3 nodes. Does > this only happen at system start, or can you duplicate 5 minutes after > systems have started? > > If it is at system start, you may need to enable "fast STP" on your switch. > It looks to me like node 3 gets some messages through but then is blocked. > STP will do this in it's default state on most switches. > > Another option if you can't enable STP is to use broadcast mode (man > openais.conf for details). > > Also verify firewalls are properly configured on all nodes. You can join > us on the irc server freenode on #linux-cluster for real-time assistance. > > Regards > -steve > > > On 09/22/2010 11:33 PM, Ranjith wrote: > >> Hi Steve, >> I am running corosync 1.2.8 >> I didn't get what u meant by blackbox. I suppose it is logs/debugs. >> I just checked logs/debugs and I am able to understand the below: >> 1--------------2--------------3 >> 1) Node1 and Node2 are already in a 2node cluster >> 2) Now Node3 sends join with ({1} , {} ) (proc_list/fail_list) >> 3) Node2 sends join ({1,2,3} , {}) and Node 1/3 updates to ({1,2,3}, {}) >> 4) Now Node 2 gets consensus after some messages [But 1 is the rep] >> 5) Consensus timeout fires at node 1 for node 3, node1 sends join as >> ({1,2}, {3}) >> 6) Node2 updates because of the above message to ({1,2}, {3}) and sends >> out join. This join received by node 3 causes it to update ({1,3}, {2}) >> 7) Node1and Node2 enter operational (fail list cleared by node2) but >> node 3 join timeout fires and again membership state. >> 8) This will continue to happen until consensus fires at node3 for node1 >> and it moves to ({3}, {1,2}) >> 9) Now Node1and Node2 from 2 node cluster and 3 forms a single node >> cluster >> 10) Now node 2 broadcast a Normal message >> 11) This message is received by Node3 as a foreign message which forces >> it to go to gather state >> 12) Again above steps .... >> The cluster is never stabilizing. >> I have attached the debugs for Node2: >> (1 - 10.102.33.115, 2 - 10.102.33.150, 3 -10.102.33.180) >> Regards, >> Ranjith >> >> On Wed, Sep 22, 2010 at 10:53 PM, Steven Dake <[email protected] >> <mailto:[email protected]>> wrote: >> >> On 09/21/2010 11:15 PM, Ranjith wrote: >> >> Hi all, >> Kindly comment on the above behaviour >> Regards, >> Ranjith >> >> On Tue, Sep 21, 2010 at 9:52 PM, Ranjith >> <[email protected] <mailto:[email protected]> >> <mailto:[email protected] >> <mailto:[email protected]>>> wrote: >> >> Hi all, >> I was testing the corosync cluster engine by using the >> testcpg exec >> provided along with the release. I am getting the below >> behaviour >> while testing some specific scenarios. Kindly comment on the >> expected behaviour. >> 1) 3 Node cluster >> 1---------2---------3 >> a) suppose I bring the nodes 1&2 up, it will form a >> ring (1,2) >> b) now bring up 3 >> c) 3 sends join which restarts the membership process >> d) (1,2) again forms the ring , 3 forms self cluster >> e) now 3 sends a join (due to join or other timeout) >> f) again membership protocol is started as 2 responds >> to this >> by going to gather state ( i believe 2 should not accept >> this as 2 >> would have earlier decided that 3 is failed) >> I am seeing a continuous loop of the above behaviour ( >> operational -> membership -> operational -> ) due to which the >> cluster is not becoming stabilized >> 2) 3 Node Cluster >> 1---------2-----------3 >> a) bring up all the three nodes at the same time (None >> of the >> nodes have seen each other before this) >> b) Now each node forms a cluster by itself .. (Here i >> think it >> should from either a (1,2) or (2,3) ring ) >> Regards, >> Ranjith >> >> >> >> >> Ranjith, >> >> Which version of corosync are you running? >> >> can you run corosync-blackbox and attach the output? >> >> Thanks >> -steve >> >> >> _______________________________________________ >> Openais mailing list >> [email protected] >> <mailto:[email protected]> >> >> https://lists.linux-foundation.org/mailman/listinfo/openais >> >> >> >> >
<<Untitled.png>>
_______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
