[Openais] corosync enters recovery repeatedly on lossy network

Tim Beale Thu, 17 Jun 2010 19:19:09 -0700

Hi,

I'm running corosync on a setup where corosync packets are getting delayed and
lost. I'm seeing corosync enter recovery mode repeatedly, which is then causing
other problems for us. (We're running trunk as at revision 2569 (8 Dec 09), so
some of these flow-on problems may already be fixed.)


Corosync entering recovery mode repeatedly doesn't look like it's fixed on the
latest trunk though. The problem is corosync is canceling its token retransmit
timeout prematurely in message_handler_mcast().

Corosync in this setup is getting some mcast packets received out of order. So
corosync receives a mcast message with a lower seq than the last token it sent
out and stops its token retransmit timer. If the token it just sent is lost,
then it doesn't retransmit the token. The token timeout occurs and corosync
enters gather/commit/recovery.

I think the message_handler_mcast() code should also check the seq of the mcast
message before stopping the retransmit timer (see attached patch). You can only
guarantee the last token sent was successfully received if another node sends a
mcast message with a higher seq.

Does anyone see any problems with this patch?

Thanks,
Tim

diff --git a/exec/totemsrp.c b/exec/totemsrp.c
index 6a11771..fb02c8f 100644
--- a/exec/totemsrp.c
+++ b/exec/totemsrp.c
@@ -397,6 +397,8 @@ struct totemsrp_instance {
 
 	int orf_token_retransmit_size;
 
+	unsigned int orf_token_retransmit_seq;
+
 	unsigned int my_token_seq;
 
 	/*
@@ -2087,6 +2089,7 @@ originated:
 	instance->my_high_seq_received = SEQNO_START_MSG;
 	instance->my_install_seq = SEQNO_START_MSG;
 	instance->last_released = SEQNO_START_MSG;
+	instance->orf_token_retransmit_seq = SEQNO_START_TOKEN;
 
 	reset_token_timeout (instance); // REVIEWED
 	reset_token_retransmit_timeout (instance); // REVIEWED
@@ -2600,6 +2603,7 @@ static int token_send (
 	orf_token_size = sizeof (struct orf_token) +
 		(orf_token->rtr_list_entries * sizeof (struct rtr_item));
 
+	instance->orf_token_retransmit_seq = orf_token->seq;
 	memcpy (instance->orf_token_retransmit, orf_token, orf_token_size);
 	instance->orf_token_retransmit_size = orf_token_size;
 	orf_token->header.nodeid = instance->my_id.addr[0].nodeid;
@@ -3757,7 +3761,8 @@ static int message_handler_mcast (
 	}
 #endif
 
-        if (srp_addr_equal (&mcast_header.system_from, &instance->my_id) == 0) {
+	if (!srp_addr_equal (&mcast_header.system_from, &instance->my_id) &&
+		instance->orf_token_retransmit_seq < mcast_header.seq) {
 		cancel_token_retransmit_timeout (instance);
 	}

_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

[Openais] corosync enters recovery repeatedly on lossy network

Reply via email to