On 06/17/2010 07:16 PM, Tim Beale wrote:
> Hi,
>
> I'm running corosync on a setup where corosync packets are getting delayed and
> lost. I'm seeing corosync enter recovery mode repeatedly, which is then 
> causing
> other problems for us. (We're running trunk as at revision 2569 (8 Dec 09), so
> some of these flow-on problems may already be fixed.)
>
> Corosync entering recovery mode repeatedly doesn't look like it's fixed on the
> latest trunk though. The problem is corosync is canceling its token retransmit
> timeout prematurely in message_handler_mcast().
>
> Corosync in this setup is getting some mcast packets received out of order. So
> corosync receives a mcast message with a lower seq than the last token it sent
> out and stops its token retransmit timer. If the token it just sent is lost,
> then it doesn't retransmit the token. The token timeout occurs and corosync
> enters gather/commit/recovery.
>
> I think the message_handler_mcast() code should also check the seq of the 
> mcast
> message before stopping the retransmit timer (see attached patch). You can 
> only
> guarantee the last token sent was successfully received if another node sends 
> a
> mcast message with a higher seq.
>
> Does anyone see any problems with this patch?
>

Missed your email - sorry for long delay.

Thanks for pointing out the problem - you found a problem in the totem 
spec!  Your logic is sound - showing a good understanding of how totem 
works..

A simpler solution altogether may just be to not cancel the token 
retransmit timer on receipt of a regular message.  I can see no good 
reason to cancel that timer, other then as a micro optimization (at the 
expense of the comparisons for checking the seqid and ringid - looks 
like a wash).

The patch you submitted doesn't handle rollover of the token (ie: when 
it reaches boundary conditions in the integer).

Regards
-steve

> Thanks,
> Tim
>
>
>
> _______________________________________________
> Openais mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/openais

_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to