Hi Steve, I added and changed some log messages, so my log won't match with the source tree. Any way I think I found the problem. This issue seems to be happening easily where a multicast messages are infrequently sent. The problem is the rtr field is filled based on my_high_seq_received! It should be set based on token->seq value.
Let's assume very simple case, just one mcast message(seq 77) was lost in node2. In node1: all messages are received up to 77. token seq = 77 my_aru = 77 my_high_seq_received = 77 in node2: message 77 was lost. my_aru = 76 token seq = 77 my_high_seq_received = 76 Once node2 gets into this state, it does not set the rtr filed for the lost message 77. Then my_aru_count keeps increasing and the corosync enters "FAILED TO RECEIVE" and gather. The totem spec. says clearly if token seq is greater than my_aru, it means this processor lost some messages, it should set rtr field to request the retransmission. The related code is in orf_token_rtr() at totemsrp.c. range = instance->my_high_seq_received - instance->my_aru; Above line should be changed to range = orf_token->seq - instance->my_aru; What was the reason of introducing my_high_seq_received? The original spec does not have this variable. Thanks hj On Fri, Mar 19, 2010 at 9:59 AM, Steven Dake <[email protected]> wrote: > can you please attach the logs from the last configuration change until > the failure? > > It would really help me understand the condition so i can generate a > reproducer. > > Thanks > -steve > >
_______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
