On 10/21/2010 09:33 PM, Ranjith wrote: > Hi Steve, > > Any views on the above? > > > Rgds, > Ranjith > > > > On Wed, Oct 6, 2010 at 10:41 AM, Ranjith <[email protected] > <mailto:[email protected]>> wrote: > > Hi Steve, > > Pls check the below (Pls correct me if I am wrong at any point): > > A (block all packets from src C) > B > C (block all packets from src A) > > Nodes > A,B,C > 1) A sends join (multicast) > 2) Only B receives. (C drops it because of ACL) > 3) B sends join (multicast) (with A,B) > > 4) A,C receive join > 5) C sends join (with A,B,C) / A also sends join (with A, B, C) > 6) Only B receives the above > > 7) B sends join (with A,B,C) > 8) A, C sends join (with A,B,C) > 9) B gets consenus from all, A waits for consensus from C > , C waits for consensus from A > 10) A has smallest id , it has to generate the commit token > but it is waiting for consensus from C > > 11) Consensus timeout fires for C at A (taking one > particular sequence of events) > 12) A sends join with (A, B, C / failset C) > > There are two cases from this point here: > > Note: There is deviation in Corosync 1.2.8 totem code in > comparison to standard totem (Totem Paper). (Pls refer to > memb_join_process()) > In the corosync 1.2.8 totem code, the fail set will be updated > only if the join message comes from a node which is known to > this node earlier from a previous ring (my_memb_list). > > Case 1: B knows A earlier and A is there in B 's my_memb_list > (due to the previous ring) > > 13) B receives the join > 14) B sends join with (A, B, C/ failset C) > 15) A, C receives the join > 16) A gets conensus and creates commit token.....Ring gets > formed btw A and B > 17) When C receives B 's join, C adds B to its failset. C > sends join with (A, B, C, failset B) > 18) B ignores the above join (protocol) > 19) Consensus timeout for A fires at C. C forms 1 node ring > 20) Now both ring (A, B) and ring (C) goes to operational state > 21) Now in Ring (A, B), a Normal message is multicast. > 22) Now when C receives this message, it treats it a foreign > message and again send join with (A, B). Again all nodes goes to > membership > 23) This keeps on happening > > > Case 2: B does not know A earlier and A is not there in B 's > my_memb_list > > In this case, since none of the nodes know each other > previously, all the nodes form 1 node cluster and goes to > operational state (My expectation was that a 2-node cluster > should be formed) >
Your analysis of the protocol is correct. We added the additional logic based upon Deb Agarwal's PHD work: http://www.google.com/url?sa=t&source=web&cd=4&sqi=2&ved=0CCQQFjAD&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.52.4028%26rep%3Drep1%26type%3Dpdf&rct=j&q=a%20dissertation%20univesity%20of%20california%20deb%20agawal&ei=FQjCTPmDMd3NjAfX-b1O&usg=AFQjCNFV0qXbKD_EcaTHhhIzU8bSA7rgmw&cad=rja > > Regards, > Ranjith > > > On Tue, Oct 5, 2010 at 11:46 PM, Steven Dake <[email protected] > <mailto:[email protected]>> wrote: > > On 10/05/2010 02:29 AM, Ranjith wrote: > > Hi Steve, > > Please comment on the below. > > > Regards, > Ranjith > > > On Fri, Oct 1, 2010 at 12:04 AM, Ranjith > <[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>> wrote: > > Hi steve, > > Network is like this: > A (block all packets from src C) > B > C (block all packets from src A) > > > > Nodes > A,B,C > A sends join (multicast) > Only B receives. (C drops it because of ACL) > B sends join (multicast) (with A,B) > > A,C receive join > C sends join (with A,B,C) > Only B receives the above > > B sends join (with A,B,C) > A, C sends join (with A,B,C) > B gets consensus but suppose A is the smallest Id > > But A never gets consensus as A cannot get join from C > > > This is not exactly how the algorithm works. I recommend > reading the totem specification if you want the details. > After you have read the specification, we can go through > an example of the proc and fail lists in this scenario. > > > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.37.767&rep=rep1&type=pdf > > <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.37.767&rep=rep1&type=pdf> > > the algorithm for handling a join message is described on > page 16 Figure 3 "Join message from processor q received" > and page 17 Figure 4 "Join message from processor q received". > > Regards > -steve > > > Am I correct till this point? > > Regards, > Ranjith > > > > > On Thu, Sep 30, 2010 at 11:49 PM, Steven Dake > <[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>>> wrote: > > On 09/30/2010 10:40 AM, Ranjith wrote: > > Hi Steve, > > I believe you mean to say that the same acl > rules should be > applied in > the outgoing side also. > But since here the nodes are not receiving > any packet (both > multicast > and unicast) from the other, i believe it > will also not send > to the > other....Is that right? > > > > That assumption is incorrect. Example: > > Nodes > A,B,C > A sends join (multicast) > B,C receive join > B sends join (multicast) > A,C receive join > C sends join (with A,B,C) > now A rejects that message. > > As a result, the nodes can never come to consensus. > > Regards > -steve > > Regards, > Ranjith > > On Thu, Sep 30, 2010 at 10:41 PM, Steven Dake > <[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>> > <mailto:[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>>>> wrote: > > On 09/30/2010 03:47 AM, Ranjith wrote: > > Hi all, > > Kindly let know whether corosync > considers the below > network as > byzantine failure i.e the case where > N1 and N3 does > not have > connectivity? > I am testing such scenarios as i > believe such a > behaviour can > happen due > to some misbehaviour in switch > (stale arp entries). > > > > What makes the fault byzantine is that > only incoming > packets are > blocked. If you block both incoming and > outgoing > packets on the > nodes, the fault is not byzantine and > totem will behave > properly. > > Regards > -steve > > Regards, > Ranjith > > > Untitled.png > On Sat, Sep 25, 2010 at 9:47 AM, Ranjith > <[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>>>> wrote: > > Hi Steve, > Just to make it clear. Do you > mean that in the > above case If > N3 is > part of the network, it should > have connectivity > to both N2 > and N1 > and if it happens so > that N3 has connectivity to N2 > only, corosync > doesnot take > care of > the same. > Regards, > Ranjith > On Sat, Sep 25, 2010 at 9:39 AM, > Steven Dake > <[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>> > <mailto:[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>>> > <mailto:[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>> > <mailto:[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>>>>> > wrote: > > On 09/24/2010 08:20 PM, > Ranjith wrote: > > Hi , > It is hard to tell what > is happening > without logs > from all 3 > nodes. Does > this only happen at > system start, or can > you duplicate 5 > minutes after > systems have started? > > >> The cluster is never stabilizing. It keeps on > switching between the > > membership and > operational state. > Below is the test > network which i am using: > > Untitled.png > > >> N1 and N3 does not reveive any packets from each > other. Here what i > > expected was that either > (N1,N2) or (N2, > N3) forms a two > node cluster > and stabilizes. But the > cluster is never > stabilizing > even > though 2 node > clusters are forming, it > is going back > to membership [I > checked the logs > and it looks like > because of the steps i > mentioned > in the > previous mail, > this seems to be happening] > > > > ...... Where did you say > you were testing a > byzantine > fault in > your original bug report? > Please be more > forthcoming in the > future. Corosync does not > protect against > byzantine faults. > Allowing one way > connectivity in network > connection = this > fault scenario. You can try > coro-netctl > (the attached > script) > which will atomically block > a network ip in > the network > to test > split brain scenarios > without actually > pulling network > cables. > > Regards > -steve > > > Regards, > Ranjith > On Fri, Sep 24, 2010 at > 11:36 PM, Steven > Dake > <[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>> > <mailto:[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>>> > <mailto:[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>> > <mailto:[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>>>> > <mailto:[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>> > <mailto:[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>>> > <mailto:[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>> > <mailto:[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>>>>>> > wrote: > > It is hard to tell > what is happening > without > logs from > all 3 nodes. > Does this only > happen at system > start, or can you > duplicate 5 > minutes after > systems have started? > > If it is at system > start, you may > need to enable > "fast > STP" on your > switch. It looks to > me like node 3 > gets some > messages > through but > then is blocked. > STP will do this > in it's > default state > on most > switches. > > Another option if > you can't enable > STP is to use > broadcast mode (man > openais.conf for > details). > > Also verify > firewalls are properly > configured on all > nodes. You can > join us on the irc > server freenode on > #linux-cluster for > real-time > assistance. > > Regards > -steve > > > On 09/22/2010 11:33 > PM, Ranjith wrote: > > Hi Steve, > I am running > corosync 1.2.8 > I didn't get > what u meant by > blackbox. I > suppose it is > logs/debugs. > I just checked > logs/debugs and > I am able to > understand the below: > > > 1--------------2--------------3 > 1) Node1 and > Node2 are already > in a 2node > cluster > 2) Now Node3 > sends join with > ({1} , {} ) > (proc_list/fail_list) > 3) Node2 sends > join ({1,2,3} , > {}) and Node 1/3 > updates to > ({1,2,3}, {}) > 4) Now Node 2 > gets consensus > after some messages > [But 1 is the rep] > 5) Consensus > timeout fires at > node 1 for node 3, > node1 sends join as > ({1,2}, {3}) > 6) Node2 updates > because of the > above message to > ({1,2}, {3}) > and sends > out join. This > join received by > node 3 > causes it to > update > ({1,3}, {2}) > 7) Node1and > Node2 enter > operational (fail list > cleared by node2) but > node 3 join > timeout fires and again > membership state. > 8) This will > continue to happen > until consensus > fires at node3 > for node1 > and it moves to > ({3}, {1,2}) > 9) Now Node1and > Node2 from 2 > node cluster and 3 > forms a single > node cluster > 10) Now node 2 > broadcast a > Normal message > 11) This message > is received by > Node3 as a > foreign > message which > forces > it to go to > gather state > 12) Again above > steps .... > The cluster is > never stabilizing. > I have attached > the debugs for > Node2: > (1 - > 10.102.33.115, 2 - > 10.102.33.150, 3 > -10.102.33.180) > Regards, > Ranjith > > On Wed, Sep 22, > 2010 at 10:53 > PM, Steven Dake > <[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>> > <mailto:[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>>> > <mailto:[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>> > <mailto:[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>>>> > <mailto:[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>> > <mailto:[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>>> > <mailto:[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>> > <mailto:[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>>>>> > <mailto:[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>> > <mailto:[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>>> > <mailto:[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>> > <mailto:[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>>>> > <mailto:[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>> > <mailto:[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>>> > <mailto:[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>> > <mailto:[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>>>>>>> > wrote: > > On > 09/21/2010 11:15 PM, > Ranjith wrote: > > Hi all, > Kindly > comment on the > above behaviour > Regards, > Ranjith > > On Tue, > Sep 21, 2010 at > 9:52 PM, Ranjith > <[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>>>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>>>>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>>>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>>>>>>> wrote: > > Hi all, > I > was testing the > corosync cluster > engine by using the > testcpg exec > > provided along with > the release. > I am > getting the below > behaviour > > while testing some > specific > scenarios. > Kindly > comment on the > > expected behaviour. > 1) > 3 Node cluster > > 1---------2---------3 > > a) suppose I > bring the > nodes 1&2 > up, it will form a > ring (1,2) > > b) now bring up 3 > > c) 3 sends join > which > restarts the > membership > process > > d) (1,2) again > forms the > ring , 3 > forms self > cluster > > e) now 3 sends > a join (due > to join > or other > timeout) > > f) again > membership protocol is > started as 2 > responds > to this > by > going to gather > state ( i > believe 2 > should not accept > this as 2 > > would have earlier > decided that > 3 is failed) > > I am seeing a > continuous > loop of > the above > behaviour ( > > operational -> > membership -> > operational > -> ) due to > which the > > cluster is not > becoming stabilized > 2) > 3 Node Cluster > > 1---------2-----------3 > > a) bring up > all the three > nodes at > the same > time (None > of the > > nodes have seen each > other > before this) > > b) Now each > node forms a > cluster > by itself .. > (Here i > think it > > should from either a > (1,2) or > (2,3) ring ) > Regards, > Ranjith > > > > > Ranjith, > > Which > version of corosync > are you running? > > can you run > corosync-blackbox and attach > the output? > > Thanks > -steve > > > > > _______________________________________________ > Openais > mailing list > [email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>>>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>> > <mailto:[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>>>>> > > https://lists.linux-foundation.org/mailman/listinfo/openais > > > > > > > > > > > > > > > > > _______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
