Jason,
Nice dig into the code/totem. Hope you didn't break the bank on red
bull :) I have a few comments inline:
On 11/06/2013 07:16 AM, jason wrote:
Hi All,
I currently encountered a problem that two nodes could not be merged
into one ring.
Initially, there were three nodes in a ring, say A, B and C. Then,
after killing C, I found that A and B could not be merged forever (I
wait at least 4 hours), unless restart at least one of them.
By analyzing the black box log, both A and B have a dead loop in doing
the following things:
1. Form a single node ring.
2. The ring is broken by a JOIN message from peer.
3. Try to form a two-node ring but consensus timeout.
4. Go to 1.
I checked the network by using omping and it was OK. Besides, I used
the default corosync.conf.example and corosync version is 1.4.6.
To analyze more deeply, I tcpdumped the traffic to see the content of
messages exchanged between the two nodes, and found the following
strange things:
1. Every 50ms (I thinks it is the join timeout):
Node A sends join message with proclist:A,B,C. faillist:B.
Node B sends join message with proclist:A,B,C. faillist:A.
2. Every 1250ms(consensus timeout):
Node A sends join message with proclist:A,B,C. faillist:B,C.
Node B sends join message with proclist:A,B,C. faillist:A,C.
Something is missing from your tcpdump analysis. Once the consensus
times out, consensus will be met:
Node A will calculate consensus based upon proclist-faillist = A = A
received all join messages in consensus list, hence consensus met
Node B will calculate consensus based upon proclist-faillist = B = B
recieved all join messages in consensus list, hence consensus met
What I would expect from step 3 is
after 1250ms:
node A will send join message with proclist: A, B, C. faillist: B,C
Node B will send join message with proclist A, B, C. faillist: A, C.
Further join messages will contain these sets. This should lead to
Node A forming a singleton configuration because consensus is agreed
Node B forming a singleton configuration because consensus is agreed
Node A sends merge detect
Node A enters gather and sends join with proclist: A, faillist: empty
Node B sends merge detect
Node B enters gather and sends join with proclist: B, faillist: empty
Node A, B receive proclist from A, B, both enter consensus and form a
new ring A, B
You said C was killed. This leads to the natural question of why it is
still in the proc list after each node forms a singleton.
It should be because both A and B treated each other as failed so that
they could not be formed forever and the single node ring is always
broken by join messages.
I am not sure the origin why both A and B set each other as failed in
join message. I just analyzed the code and found the most possible
reason make it happen is network partition. So I made the following
assumption about what was happened:
1. Initially, ring(A,B,C).
2. A and B network partition, "in the same time", C is down.
3. Node A sends join message with proclist:A,B,C. faillist:NULL. Node
B sends join message with proclist:A,B,C. faillist:NULL.
4. Both A and B consensus timeout due to network partition.
5. A and B network remerged.
6. Node A sends join message with proclist:A,B,C. faillist:B,C. and
create ring(A). Node B sends join message with proclist:A,B,C.
faillist:A,C. and create ring(B).
7. Say join message with proclist:A,B,C. faillist:A,C which sent by
node B is received by node A because network remerged.
8. Node A shifts to gather state and send out a modified join message
with proclist:A,B,C. faillist:B. such join message will prevent both A
and B from merging.
9. Node A consensus timeout (caused by waiting node C) and sends join
message with proclist:A,B,C. faillist:B,C again.
good analysis
Same thing happens on node B, so A and B will dead loop forever in
step 7,8 and 9.
If my assumption and analysis is right, then I think it is step 8 that
did the wrong thing. Because according to the paper I found at
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.52.4028&rep=rep1&type=pdf
, it says: "if a processor receives a join message in the operational
state and if the receiver's identifier is in the join message's fail
list, ... then it ignores the join message."
Figure 4.4 doesn't match the text. I've found in these cases in
academic papers, the text takes precedence.
So I create a patch to apply the above algorithm to try to solve the
publem:
--- ./corosync-1.4.6-orig/exec/totemsrp.cWed May 29 14:33:27 2013 UTC
+++ ./corosync-1.4.6/exec/totemsrp.cWed Nov 6 13:12:30 2013 UTC
@@ -4274,6 +4274,36 @@
srp_addr_copy_endian_convert (&out->system_from, &in->system_from);
}
+static int ignore_join_under_operational (
+struct totemsrp_instance *instance,
+const struct memb_join *memb_join)
+{
+struct srp_addr *proc_list;
+struct srp_addr *failed_list;
+unsigned long long ring_seq;
+
+proc_list = (struct srp_addr *)memb_join->end_of_memb_join;
+failed_list = proc_list + memb_join->proc_list_entries;
+ring_seq = memb_join->ring_seq;
+
+if (memb_set_subset (&instance->my_id, 1,
+failed_list, memb_join->failed_list_entries)) {
+return 1;
+}
+
+/* In operational state, my_proc_list is exactly the same as
+ my_memb_list. */
+
what is the point of the below code?
+if ((memb_set_subset (&memb_join->system_from, 1,
+instance->my_memb_list,
+instance->my_memb_entries)) &&
+(ring_seq < instance->my_ring_id.seq)) {
+return 1;
+}
+
+return 0;
+}
+
static int message_handler_memb_join (
struct totemsrp_instance *instance,
const void *msg,
@@ -4304,7 +4334,9 @@
}
switch (instance->memb_state) {
case MEMB_STATE_OPERATIONAL:
-memb_join_process (instance, memb_join);
if (ignore_join_under_operational(instance, memb_join) == 0) {
+if (0 == ignore_join_under_operational(instance, memb_join)) {
+memb_join_process (instance, memb_join);
+}
break;
case MEMB_STATE_GATHER:
Currently, I haven't reproduced the problem in a 3-node cluster, but I
have reproduced the "a processor receives a join message in the
operational state and the receiver's identifier is in the join
message's fail list" circumstance in a two-node evniroment, by using
the following step:
1. iptables --A INPUT --i eth0 --p udp ! -sport domain --j DROP
2. usleep 2126000
3. iptables --D INPUT --i eth0 --p udp ! -sport domain --j DROP
In the two-node environment, there is no dead loop issue as in the
3-node one, because there is no consensus timeout caused by the third
dead node in step 9. But it can still be used to proof the patch.
Please take a look at this issue, Thanks!
Please use git send-email to send the email. It allows an easier
merging of the patch and attribution of the work.
Regards
-steve
--
Yours,
Jason
_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss
_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss