On Sat, 2010-03-20 at 12:46 -0600, hj lee wrote: > > > On Fri, Mar 19, 2010 at 7:42 PM, Steven Dake <[email protected]> wrote: > On Fri, 2010-03-19 at 16:45 -0600, hj lee wrote: > > Hi Steve, > > > > I added and changed some log messages, so my log won't match > with the > > source tree. Any way I think I found the problem. This issue > seems to > > be happening easily where a multicast messages are > infrequently sent. > > The problem is the rtr field is filled based on > my_high_seq_received! > > It should be set based on token->seq value. > > > > > I did notice this inconsistency and was thinking along these > lines too, > but I wanted to see your log to see if some other events were > occuring > related to oldring_state_save()/restore(). Another > possibility is some > sort of misfeeding to or from the regular/recovery queue. (ie > do you > have more the this retransmission bug) > > If I have a time, I will try to post some logs. I have some issues in > recovery also. When the corosync enters GATHER/RECOVERY mode with > whatever reasons, there are cases it just keeps looping > GATHER/COMMIT/RECOVERY/OPERATIONAL and again and again ... I haven't > had the time to debug this kind of issues, right now just trying to > prevent whatever errors at first. > > > > Let's assume very simple case, just one mcast message(seq > 77) was lost > > in node2. > > > > In node1: > > all messages are received up to 77. > > token seq = 77 > > my_aru = 77 > > my_high_seq_received = 77 > > > > in node2: > > message 77 was lost. > > my_aru = 76 > > token seq = 77 > > my_high_seq_received = 76 > > > > > > Once node2 gets into this state, it does not set the rtr > filed for the > > lost message 77. Then my_aru_count keeps increasing and the > corosync > > enters "FAILED TO RECEIVE" and gather. The totem spec. says > clearly if > > token seq is greater than my_aru, it means this processor > lost some > > messages, it should set rtr field to request the > retransmission. > > > > The related code is in orf_token_rtr() at totemsrp.c. > > > > range = instance->my_high_seq_received - instance->my_aru; > > > > Above line should be changed to > > > > range = orf_token->seq - instance->my_aru; > > > > > Ya good catch. More totem experts = win for the community ;) > > Thank you very much. Anyway this may generate extra mcast message that > is already on the way but hasn't been received yet. So yesterday I was > thinking only use token seq if my_aru_count is greater then 5 or some > number. >
I don't believe that is the case. The only time this problem really occurs is when the last message was not received by a particular processor. In that case, range will be off by the last messages not received in a row, triggering the fail to recv state.. On receipt of the token, we flush the file descriptor related to mcast recv messages to ensure the internal state of totem is up to date with regards to every piece of information we have available. Even if that were to happen, it only happens in recovery of lost messages which is rare in modern lan environments. So rare this bug has been in the totem code for 8 years undetected... Regards -steve > _______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
