Re: [corosync] totemsrp: Cancel token holding while in retransmition

jason Tue, 05 Aug 2014 18:11:29 -0700

Sorry, the previous patch is wrong. Here is the correction.
On Aug 5, 2014 10:18 PM, "Christine Caulfield" <[email protected]> wrote:


> Hi Jason,
>
> Thanks for testing that - and the extra info. I'll have another think
> then. If I can't come up with anything more we might go with your patch.
>
> Chrissie
>
> On 05/08/14 13:01, jason wrote:
>
>> Hi Christine,
>> I have tested your patch but it can not solve my problem. By adding
>> printf, I found that whenever during retransmition occured in my test
>> case or not, the retrans_message_queue is always empty. It seems that
>> the retrans_message_queue is for recovery state used only?
>>
>> On Aug 5, 2014 3:50 PM, "Christine Caulfield" <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>>     On 01/08/14 10:50, Christine Caulfield wrote:
>>
>>         On 01/08/14 10:42, Jan Friesse wrote:
>>
>>             Jason,
>>
>>
>>                 Hi All,
>>
>>                 I have encountered a problem that when there is no other
>>                 activty on
>>                 ring but
>>                 only retransmition, and token is in hold mode, the
>>                 retransmition will
>>                 become
>>                 slow. More over, if the retransmition is always fail but
>>                 token
>>
>>
>>             Yes
>>
>>                 rotation works well,
>>                 then it takes quite a lone time(fail_to_recv_const *
>>                 token_hold = 2500
>>                 * 180ms = 450sec) for the retransmiting node to meet the
>>                 "FAILED TO
>>                 RECEIVE" condition to
>>                 re-construct a new ring. This can be reporduced by the
>>                 following steps:
>>
>>                       1) Create a two-node cluster in udpu transport mode.
>>                       2) Wait until there is no other activty on ring.
>>                       3) One, or both nodes delete each other in nodelist
>> in
>>                 corosync.conf
>>                       4) corosync-cfgtool -R, this can cause a message
>>                 retransmition,
>>                 but I am
>>                       not sure why.
>>                       5) Since tokenrotation still works well, but the
>>                 retransmition
>>                 can not be
>>                       satisfied due to node deletion, so, only "FAILED
>>                 TO RECEIVE"
>>                 condition can form new
>>                       ring. But we need to wait 450 seconds for it to
>>                 happen. During
>>                 this wait,
>>                       we saw the following logs:
>>
>>
>>             This is really weird case.
>>
>>                       Jul 30 11:21:06 notice  [TOTEM ] Retransmit List: e
>>                       Jul 30 11:21:06 notice  [TOTEM ] Retransmit List: e
>>                       Jul 30 11:21:06 notice  [TOTEM ] Retransmit List: e
>>                       Jul 30 11:21:06 notice  [TOTEM ] Retransmit List: e
>>                       Jul 30 11:21:06 notice  [TOTEM ] Retransmit List: e
>>                       ...
>>
>>
>>                 This problem can be solved by adding
>>                 token_hold_cancel_send() in both
>>                 retransmition request and response conditions in
>>                 orf_token_rtr() to
>>                 speed up
>>                 retransmition. I created a patch below, any comments?
>>
>>
>>             Ok. Patch looks fine, but during review I had other idea.
>>             What about
>>             prohibit starting of hold mode where there are messages to
>>             retransmit?
>>             Such solution may be cleaner, isn't it?
>>
>>             Anyway. This is change in very critical part of the code, so
>>             Chrissie,
>>             can you please take a look to patch and express your opinion?
>>
>>
>>
>>         I've been looking it over yesterday. It's a problem I have
>>         definitely
>>         seen myself on some VM systems so it's certainly not an isolated
>>         case. I
>>         think Honza is right that there might be a better way of fixing
>>         it so
>>         I'll have a look.
>>
>>         Chrissie
>>
>>
>>
>>     Annoyingly my common reproducer seems not to be working and I can't
>>     get yours to make it happen either. If you can still reproduce it
>>     could you try this patch for me please?
>>
>>     Chrissie
>>
>>
>>     _______________________________________________
>>     discuss mailing list
>>     [email protected] <mailto:[email protected]>
>>     http://lists.corosync.org/mailman/listinfo/discuss
>>
>>
>

From dc2d0c2bc75492cada193909c2cd66fba4367d67 Mon Sep 17 00:00:00 2001
From: Jason HU <[email protected]>
Date: Wed, 6 Aug 2014 00:10:56 +0800
Subject: [PATCH] [totemsrp] Cancel token holding while in retransmition

---
 exec/totemsrp.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/exec/totemsrp.c b/exec/totemsrp.c
index dcda8d1..b603ef5 100644
--- a/exec/totemsrp.c
+++ b/exec/totemsrp.c
@@ -3650,6 +3650,12 @@ static int message_handler_orf_token (
 		transmits_allowed = fcc_calculate (instance, token);
 		mcasted_retransmit = orf_token_rtr (instance, token, &transmits_allowed);
 
+		if (instance->my_token_held == 1 &&
+			(token->rtr_list_entries > 0 || mcasted_retransmit > 0)) {
+			instance->my_token_held = 0;
+			forward_token = 1;
+		}
+
 		fcc_rtr_limit (instance, token, &transmits_allowed);
 		mcasted_regular = orf_token_mcast (instance, token, transmits_allowed);
 /*
-- 
1.9.4.msysgit.0

_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss

Re: [corosync] totemsrp: Cancel token holding while in retransmition

Reply via email to