On Fri, Mar 19, 2010 at 7:42 PM, Steven Dake <[email protected]> wrote:
> On Fri, 2010-03-19 at 16:45 -0600, hj lee wrote: > > Hi Steve, > > > > I added and changed some log messages, so my log won't match with the > > source tree. Any way I think I found the problem. This issue seems to > > be happening easily where a multicast messages are infrequently sent. > > The problem is the rtr field is filled based on my_high_seq_received! > > It should be set based on token->seq value. > > > > I did notice this inconsistency and was thinking along these lines too, > but I wanted to see your log to see if some other events were occuring > related to oldring_state_save()/restore(). Another possibility is some > sort of misfeeding to or from the regular/recovery queue. (ie do you > have more the this retransmission bug) > If I have a time, I will try to post some logs. I have some issues in recovery also. When the corosync enters GATHER/RECOVERY mode with whatever reasons, there are cases it just keeps looping GATHER/COMMIT/RECOVERY/OPERATIONAL and again and again ... I haven't had the time to debug this kind of issues, right now just trying to prevent whatever errors at first. > > > Let's assume very simple case, just one mcast message(seq 77) was lost > > in node2. > > > > In node1: > > all messages are received up to 77. > > token seq = 77 > > my_aru = 77 > > my_high_seq_received = 77 > > > > in node2: > > message 77 was lost. > > my_aru = 76 > > token seq = 77 > > my_high_seq_received = 76 > > > > > > Once node2 gets into this state, it does not set the rtr filed for the > > lost message 77. Then my_aru_count keeps increasing and the corosync > > enters "FAILED TO RECEIVE" and gather. The totem spec. says clearly if > > token seq is greater than my_aru, it means this processor lost some > > messages, it should set rtr field to request the retransmission. > > > > The related code is in orf_token_rtr() at totemsrp.c. > > > > range = instance->my_high_seq_received - instance->my_aru; > > > > Above line should be changed to > > > > range = orf_token->seq - instance->my_aru; > > > > Ya good catch. More totem experts = win for the community ;) > Thank you very much. Anyway this may generate extra mcast message that is already on the way but hasn't been received yet. So yesterday I was thinking only use token seq if my_aru_count is greater then 5 or some number. > > > What was the reason of introducing my_high_seq_received? The original > > spec does not have this variable. > > > > It does indirectly. EVS requires that we continue to search through the > list of messages in the sort queue when we lose a message during the > entry to the operational state. As an example: > n1, n2, n3 > n1 sends a seq = 53 > n2 sends b seq = 54 > n3 sends c seq = 55 > n1 receives a, c > n3 receives a, c > n2 fails > > In this case we should deliver A, then a transitional configuration, > then skip any messages missing that could not be recovered (because they > are from failed processors specifically message b). Then we should > deliver C. To deliver C, we need to know the last entry in the sort > queue, and my_high_seq_received is how we know when we have reached that > point (ie how to calculate the range of the queue). > > > Good catch > -steve > > -- Peakpoint Service Cluster Setup, Troubleshooting & Development [email protected] (303) 997-2823
_______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
