Re: [corosync] issue about two nodes could not be merged into one ring

Steven Dake Wed, 06 Nov 2013 08:39:06 -0800

Jason,

Nice dig into the code/totem. Hope you didn't break the bank on redbull :) I have a few comments inline:


On 11/06/2013 07:16 AM, jason wrote:

Hi All,
I currently encountered a problem that two nodes could not be mergedinto one ring.Initially, there were three nodes in a ring, say A, B and C. Then,after killing C, I found that A and B could not be merged forever (Iwait at least 4 hours), unless restart at least one of them.By analyzing the black box log, both A and B have a dead loop in doingthe following things:
1. Form a single node ring.
2. The ring is broken by a JOIN message from peer.
3. Try to form a two-node ring but consensus timeout.
4. Go to 1.
I checked the network by using omping and it was OK. Besides, I usedthe default corosync.conf.example and corosync version is 1.4.6.
To analyze more deeply, I tcpdumped the traffic to see the content ofmessages exchanged between the two nodes, and found the followingstrange things:
1. Every 50ms (I thinks it is the join timeout):
    Node A sends join message with proclist:A,B,C. faillist:B.
    Node B sends join message with proclist:A,B,C. faillist:A.

2. Every 1250ms(consensus timeout):
    Node A sends join message with proclist:A,B,C. faillist:B,C.
    Node B sends join message with proclist:A,B,C. faillist:A,C.

Something is missing from your tcpdump analysis. Once the consensustimes out, consensus will be met:

Node A will calculate consensus based upon proclist-faillist = A = Areceived all join messages in consensus list, hence consensus met

Node B will calculate consensus based upon proclist-faillist = B = Brecieved all join messages in consensus list, hence consensus met


What I would expect from step 3 is
after 1250ms:
node A will send join message with proclist: A, B, C.  faillist: B,C
Node B will send join message with proclist A, B, C.  faillist: A, C.

Further join messages will contain these sets.  This should lead to

Node A forming a singleton configuration because consensus is agreed
Node B forming a singleton configuration because consensus is agreed

Node A sends merge detect
Node A enters gather and sends join with proclist: A, faillist: empty

Node B sends merge detect
Node B enters gather and sends join with proclist: B, faillist: empty

Node A, B receive proclist from A, B, both enter consensus and form anew ring A, B

You said C was killed. This leads to the natural question of why it isstill in the proc list after each node forms a singleton.

It should be because both A and B treated each other as failed so thatthey could not be formed forever and the single node ring is alwaysbroken by join messages.
I am not sure the origin why both A and B set each other as failed injoin message. I just analyzed the code and found the most possiblereason make it happen is network partition. So I made the followingassumption about what was happened:
1. Initially, ring(A,B,C).
2. A and B network partition, "in the same time", C is down.
3. Node A sends join message with proclist:A,B,C. faillist:NULL. NodeB sends join message with proclist:A,B,C. faillist:NULL.
4. Both A and B consensus timeout due to network partition.
5. A and B network remerged.
6. Node A sends join message with proclist:A,B,C. faillist:B,C. andcreate ring(A). Node B sends join message with proclist:A,B,C.faillist:A,C. and create ring(B).7. Say join message with proclist:A,B,C. faillist:A,C which sent bynode B is received by node A because network remerged.8. Node A shifts to gather state and send out a modified join messagewith proclist:A,B,C. faillist:B. such join message will prevent both Aand B from merging.9. Node A consensus timeout (caused by waiting node C) and sends joinmessage with proclist:A,B,C. faillist:B,C again.


good analysis

Same thing happens on node B, so A and B will dead loop forever instep 7,8 and 9.
If my assumption and analysis is right, then I think it is step 8 thatdid the wrong thing. Because according to the paper I found athttp://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.52.4028&rep=rep1&type=pdf, it says: "if a processor receives a join message in the operationalstate and if the receiver's identifier is in the join message's faillist, ... then it ignores the join message."

Figure 4.4 doesn't match the text. I've found in these cases inacademic papers, the text takes precedence.

So I create a patch to apply the above algorithm to try to solve thepublem:


--- ./corosync-1.4.6-orig/exec/totemsrp.cWed May 29 14:33:27 2013 UTC
+++ ./corosync-1.4.6/exec/totemsrp.cWed Nov 6 13:12:30 2013 UTC
@@ -4274,6 +4274,36 @@
srp_addr_copy_endian_convert (&out->system_from, &in->system_from);
 }
+static int ignore_join_under_operational (
+struct totemsrp_instance *instance,
+const struct memb_join *memb_join)
+{
+struct srp_addr *proc_list;
+struct srp_addr *failed_list;
+unsigned long long ring_seq;
+
+proc_list = (struct srp_addr *)memb_join->end_of_memb_join;
+failed_list = proc_list + memb_join->proc_list_entries;
+ring_seq = memb_join->ring_seq;
+
+if (memb_set_subset (&instance->my_id, 1,
+failed_list, memb_join->failed_list_entries)) {
+return 1;
+}
+
+/* In operational state, my_proc_list is exactly the same as
+  my_memb_list. */
+

what is the point of the below code?

+if ((memb_set_subset (&memb_join->system_from, 1,
+instance->my_memb_list,
+instance->my_memb_entries)) &&
+(ring_seq < instance->my_ring_id.seq)) {
+return 1;
+}
+
+return 0;
+}
+
 static int message_handler_memb_join (
struct totemsrp_instance *instance,
const void *msg,
@@ -4304,7 +4334,9 @@
}
switch (instance->memb_state) {
case MEMB_STATE_OPERATIONAL:
-memb_join_process (instance, memb_join);


if (ignore_join_under_operational(instance, memb_join) == 0) {

+if (0 == ignore_join_under_operational(instance, memb_join)) {
+memb_join_process (instance, memb_join);
+}
break;
case MEMB_STATE_GATHER:
Currently, I haven't reproduced the problem in a 3-node cluster, but Ihave reproduced the "a processor receives a join message in theoperational state and the receiver's identifier is in the joinmessage's fail list" circumstance in a two-node evniroment, by usingthe following step:
1. iptables --A INPUT --i eth0 --p udp ! -sport domain --j DROP
2. usleep 2126000
3. iptables --D INPUT --i eth0 --p udp ! -sport domain --j DROP
In the two-node environment, there is no dead loop issue as in the3-node one, because there is no consensus timeout caused by the thirddead node in step 9. But it can still be used to proof the patch.
Please take a look at this issue, Thanks!

Please use git send-email to send the email. It allows an easiermerging of the patch and attribution of the work.


Regards
-steve


--
Yours,
Jason


_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss

Re: [corosync] issue about two nodes could not be merged into one ring

Reply via email to