Hi Christine, I have tested your patch but it can not solve my problem. By adding printf, I found that whenever during retransmition occured in my test case or not, the retrans_message_queue is always empty. It seems that the retrans_message_queue is for recovery state used only? On Aug 5, 2014 3:50 PM, "Christine Caulfield" <[email protected]> wrote:
> On 01/08/14 10:50, Christine Caulfield wrote: > >> On 01/08/14 10:42, Jan Friesse wrote: >> >>> Jason, >>> >>> >>> Hi All, >>>> >>>> I have encountered a problem that when there is no other activty on >>>> ring but >>>> only retransmition, and token is in hold mode, the retransmition will >>>> become >>>> slow. More over, if the retransmition is always fail but token >>>> >>> >>> Yes >>> >>> rotation works well, >>>> then it takes quite a lone time(fail_to_recv_const * token_hold = 2500 >>>> * 180ms = 450sec) for the retransmiting node to meet the "FAILED TO >>>> RECEIVE" condition to >>>> re-construct a new ring. This can be reporduced by the following steps: >>>> >>>> 1) Create a two-node cluster in udpu transport mode. >>>> 2) Wait until there is no other activty on ring. >>>> 3) One, or both nodes delete each other in nodelist in >>>> corosync.conf >>>> 4) corosync-cfgtool -R, this can cause a message retransmition, >>>> but I am >>>> not sure why. >>>> 5) Since tokenrotation still works well, but the retransmition >>>> can not be >>>> satisfied due to node deletion, so, only "FAILED TO RECEIVE" >>>> condition can form new >>>> ring. But we need to wait 450 seconds for it to happen. During >>>> this wait, >>>> we saw the following logs: >>>> >>>> >>> This is really weird case. >>> >>> Jul 30 11:21:06 notice [TOTEM ] Retransmit List: e >>>> Jul 30 11:21:06 notice [TOTEM ] Retransmit List: e >>>> Jul 30 11:21:06 notice [TOTEM ] Retransmit List: e >>>> Jul 30 11:21:06 notice [TOTEM ] Retransmit List: e >>>> Jul 30 11:21:06 notice [TOTEM ] Retransmit List: e >>>> ... >>>> >>>> >>>> This problem can be solved by adding token_hold_cancel_send() in both >>>> retransmition request and response conditions in orf_token_rtr() to >>>> speed up >>>> retransmition. I created a patch below, any comments? >>>> >>>> >>> Ok. Patch looks fine, but during review I had other idea. What about >>> prohibit starting of hold mode where there are messages to retransmit? >>> Such solution may be cleaner, isn't it? >>> >>> Anyway. This is change in very critical part of the code, so Chrissie, >>> can you please take a look to patch and express your opinion? >>> >> >> >> I've been looking it over yesterday. It's a problem I have definitely >> seen myself on some VM systems so it's certainly not an isolated case. I >> think Honza is right that there might be a better way of fixing it so >> I'll have a look. >> >> Chrissie >> > > > Annoyingly my common reproducer seems not to be working and I can't get > yours to make it happen either. If you can still reproduce it could you try > this patch for me please? > > Chrissie > > > _______________________________________________ > discuss mailing list > [email protected] > http://lists.corosync.org/mailman/listinfo/discuss > >
_______________________________________________ discuss mailing list [email protected] http://lists.corosync.org/mailman/listinfo/discuss
