On Fri, Aug 1, 2014 at 2:50 AM, Christine Caulfield <[email protected]> wrote:
> On 01/08/14 10:42, Jan Friesse wrote: > >> Jason, >> >> >> Hi All, >>> >>> I have encountered a problem that when there is no other activty on >>> ring but >>> only retransmition, and token is in hold mode, the retransmition will >>> become >>> slow. More over, if the retransmition is always fail but token >>> >> >> Yes >> >> rotation works well, >>> then it takes quite a lone time(fail_to_recv_const * token_hold = 2500 >>> * 180ms = 450sec) for the retransmiting node to meet the "FAILED TO >>> RECEIVE" condition to >>> re-construct a new ring. This can be reporduced by the following steps: >>> >>> 1) Create a two-node cluster in udpu transport mode. >>> 2) Wait until there is no other activty on ring. >>> 3) One, or both nodes delete each other in nodelist in corosync.conf >>> 4) corosync-cfgtool -R, this can cause a message retransmition, >>> but I am >>> not sure why. >>> 5) Since tokenrotation still works well, but the retransmition >>> can not be >>> satisfied due to node deletion, so, only "FAILED TO RECEIVE" >>> condition can form new >>> ring. But we need to wait 450 seconds for it to happen. During >>> this wait, >>> we saw the following logs: >>> >>> >> This is really weird case. >> >> Jul 30 11:21:06 notice [TOTEM ] Retransmit List: e >>> Jul 30 11:21:06 notice [TOTEM ] Retransmit List: e >>> Jul 30 11:21:06 notice [TOTEM ] Retransmit List: e >>> Jul 30 11:21:06 notice [TOTEM ] Retransmit List: e >>> Jul 30 11:21:06 notice [TOTEM ] Retransmit List: e >>> ... >>> >>> >>> This problem can be solved by adding token_hold_cancel_send() in both >>> retransmition request and response conditions in orf_token_rtr() to >>> speed up >>> retransmition. I created a patch below, any comments? >>> >>> >> Ok. Patch looks fine, but during review I had other idea. What about >> prohibit starting of hold mode where there are messages to retransmit? >> Such solution may be cleaner, isn't it? >> >> This seems better to me - pragmatically speaking it prevents the scenario where a hold cancel message is lost via UDP and prevents extra complication with only one more conditional. > Anyway. This is change in very critical part of the code, so Chrissie, >> can you please take a look to patch and express your opinion? >> > > > I've been looking it over yesterday. It's a problem I have definitely seen > myself on some VM systems so it's certainly not an isolated case. I think > Honza is right that there might be a better way of fixing it so I'll have a > look. > > The patch will work, but I think Honza's approach is better. Regards, -steve > Chrissie > should > > Regards, >> Honza >> >> >>> Signed-off-by: Jason HU <[email protected]> >>> >>> ------------------------------- exec/totemsrp.c >>> ------------------------------- >>> index dcda8d1..c227c44 100644 >>> @@ -2672,6 +2672,7 @@ static int orf_token_rtr ( >>> >>> strcpy (retransmit_msg, "Retransmit List: "); >>> if (orf_token->rtr_list_entries) { >>> + token_hold_cancel_send(instance); >>> log_printf (instance->totemsrp_log_level_debug, >>> "Retransmit List %d", orf_token->rtr_list_entries); >>> for (i = 0; i < orf_token->rtr_list_entries; i++) { >>> @@ -2726,6 +2727,10 @@ static int orf_token_rtr ( >>> range = orf_token->seq - instance->my_aru; >>> assert (range < QUEUE_RTR_ITEMS_SIZE_MAX); >>> >>> + if (range >= 1) { >>> + token_hold_cancel_send(instance); >>> + } >>> + >>> for (i = 1; (orf_token->rtr_list_entries < >>> RETRANSMIT_ENTRIES_MAX) && >>> (i <= range); i++) { >>> >>> >>> >>> >>> >> > _______________________________________________ > discuss mailing list > [email protected] > http://lists.corosync.org/mailman/listinfo/discuss >
_______________________________________________ discuss mailing list [email protected] http://lists.corosync.org/mailman/listinfo/discuss
