Re: [Openais] [Corosync] Corosync does not retransmit the lost mcast message

hj lee Sat, 20 Mar 2010 11:49:30 -0700

On Fri, Mar 19, 2010 at 7:42 PM, Steven Dake <[email protected]> wrote:


> On Fri, 2010-03-19 at 16:45 -0600, hj lee wrote:
> > Hi Steve,
> >
> > I added and changed some log messages, so my log won't match with the
> > source tree. Any way I think I found the problem. This issue seems to
> > be happening easily where a multicast messages are infrequently sent.
> > The problem is the rtr field is filled based on my_high_seq_received!
> > It should be set based on token->seq value.
> >
>
> I did notice this inconsistency and was thinking along these lines too,
> but I wanted to see your log to see if some other events were occuring
> related to oldring_state_save()/restore().  Another possibility is some
> sort of misfeeding to or from the regular/recovery queue. (ie do you
> have more the this retransmission bug)
>

If I have a time, I will try to post some logs. I have some issues in
recovery also. When the corosync enters GATHER/RECOVERY mode with whatever
reasons, there are cases it just keeps looping
GATHER/COMMIT/RECOVERY/OPERATIONAL and again and again ... I haven't had the
time to debug this kind of issues, right now just trying to prevent whatever
errors at first.

>
> > Let's assume very simple case, just one mcast message(seq 77) was lost
> > in node2.
> >
> > In node1:
> > all messages are received up to 77.
> > token seq = 77
> > my_aru = 77
> > my_high_seq_received = 77
> >
> > in node2:
> > message 77 was lost.
> > my_aru = 76
> > token seq = 77
> > my_high_seq_received = 76
> >
> >
> > Once node2 gets into this state, it does not set the rtr filed for the
> > lost message 77. Then my_aru_count keeps increasing and the corosync
> > enters "FAILED TO RECEIVE" and gather. The totem spec. says clearly if
> > token seq is greater than my_aru, it means this processor lost some
> > messages, it should set rtr field to request the retransmission.
> >
> > The related code is in orf_token_rtr() at totemsrp.c.
> >
> > range = instance->my_high_seq_received - instance->my_aru;
> >
> > Above line should be changed to
> >
> > range = orf_token->seq - instance->my_aru;
> >
>
> Ya good catch.  More totem experts = win for the community ;)
>

Thank you very much. Anyway this may generate extra mcast message that is
already on the way but hasn't been received yet. So yesterday I was thinking
only use token seq if my_aru_count is greater then 5 or some number.

>
> > What was the reason of introducing my_high_seq_received? The original
> > spec does not have this variable.
> >
>
> It does indirectly.  EVS requires that we continue to search through the
> list of messages in the sort queue when we lose a message during the
> entry to the operational state.  As an example:
> n1, n2, n3
> n1 sends a seq = 53
> n2 sends b seq = 54
> n3 sends c seq = 55
> n1 receives a, c
> n3 receives a, c
> n2 fails
>
> In this case we should deliver A, then a transitional configuration,
> then skip any messages missing that could not be recovered (because they
> are from failed processors specifically message b). Then we should
> deliver C.  To deliver C, we need to know the last entry in the sort
> queue, and my_high_seq_received is how we know when we have reached that
> point (ie how to calculate the range of the queue).
>
>
> Good catch
> -steve
>
>
-- 
Peakpoint Service

Cluster Setup, Troubleshooting & Development
[email protected]
(303) 997-2823

_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [Corosync] Corosync does not retransmit the lost mcast message

Reply via email to