Re: [Openais] [Corosync] Corosync does not retransmit the lost mcast message

hj lee Fri, 19 Mar 2010 15:48:06 -0700

Hi Steve,

I added and changed some log messages, so my log won't match with the source
tree. Any way I think I found the problem. This issue seems to be happening
easily where a multicast messages are infrequently sent. The problem is the
rtr field is filled based on my_high_seq_received! It should be set based on
token->seq value.

Let's assume very simple case, just one mcast message(seq 77) was lost in
node2.

In node1:
all messages are received up to 77.
token seq = 77
my_aru = 77
my_high_seq_received = 77

in node2:
message 77 was lost.
my_aru = 76
token seq = 77
my_high_seq_received = 76

Once node2 gets into this state, it does not set the rtr filed for the lost
message 77. Then my_aru_count keeps increasing and the corosync enters
"FAILED TO RECEIVE" and gather. The totem spec. says clearly if token seq is
greater than my_aru, it means this processor lost some messages, it should
set rtr field to request the retransmission.

The related code is in orf_token_rtr() at totemsrp.c.

range = instance->my_high_seq_received - instance->my_aru;

Above line should be changed to

range = orf_token->seq - instance->my_aru;

What was the reason of introducing my_high_seq_received? The original spec
does not have this variable.

Thanks
hj

On Fri, Mar 19, 2010 at 9:59 AM, Steven Dake <[email protected]> wrote:

> can you please attach the logs from the last configuration change until
> the failure?
>
> It would really help me understand the condition so i can generate a
> reproducer.
>
> Thanks
> -steve
>
>

_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [Corosync] Corosync does not retransmit the lost mcast message

Reply via email to