Re: [ovs-dev] [PATCH] reconnect: Fix broken inactivity probe if there is no other reason to wake up.

Ilya Maximets Thu, 24 Feb 2022 11:52:53 -0800

On 2/22/22 17:00, Dumitru Ceara wrote:
> On 2/22/22 16:45, Ilya Maximets wrote:
>> On 2/22/22 13:05, Dumitru Ceara wrote:
>>> On 2/21/22 15:16, Ilya Maximets wrote:
>>>> The purpose of reconnect_deadline__() function is twofold:
>>>>
>>>> 1. Its result is used to tell if the state has to be changed right now
>>>>    in reconnect_run().
>>>> 2. Its result also used to determine when the process need to wake up
>>>>    and call reconnect_run() for a next time, i.e. when the state may
>>>>    need to be changed next time.
>>>>
>>>> Since introduction of the 'receive-attempted' feature, the function
>>>> returns LLONG_MAX if the deadline is in the future.  That works for
>>>> the first case, but doesn't for the second one, because we don't
>>>> really know when we need to call reconnect_run().
>>>>
>>>> This is the problem for applications where jsonrpc connection is the
>>>> only source of wake ups, e.g. ovn-northd.  When the network goes down
>>>> silently, e.g. server looses IP address due to DHCP failure, ovn-northd
>>>> will sleep in the poll loop indefinitely after being told that it
>>>> doesn't need to call reconnect_run() (deadline == LLONG_MAX).
>>>>
>>>> Fixing that by actually returning the expected time if it is in the
>>>> future, so we will know when to wake up.  In order to keep the
>>>> 'receive-attempted' feature, returning 'now + 1' in case where the
>>>> time has already passed, but receive wasn't attempted.  That will
>>>> trigger a fast wake up, so the application will be able to attempt the
>>>> receive even if there was no real events.  In a correctly written
>>>> application we should not fall into this case more than once in a row.
>>>> '+ 1' ensures that we will not transition into a different state
>>>> prematurely, i.e. before the receive is actually attempted.
>>>>
>>>> Fixes: 4241d652e465 ("jsonrpc: Avoid disconnecting prematurely due to long 
>>>> poll intervals.")
>>>> Signed-off-by: Ilya Maximets <[email protected]>
>>>> ---
>>>
>>> Hi Ilya,
>>>
>>> This change looks good to me, I just have a couple minor comment-related
>>> remarks; in any case:
>>>
>>> Acked-by: Dumitru Ceara <[email protected]>
>>>
>>> Regards,
>>> Dumitru
>>>
>>>>
>>>> CC: Wentao Jia
>>>> This might be the problem you tried to fix in northd by setting a probe
>>>> interval.  Would be great, if you can test your case with this patch.
>>>>
>>>>  lib/reconnect.c         |  29 +++++++++---
>>>>  python/ovs/reconnect.py |  40 +++++++++++-----
>>>>  tests/reconnect.at      | 102 +++++++++++++++++++++++++++++++---------
>>>>  3 files changed, 133 insertions(+), 38 deletions(-)
>>>>
>>>> diff --git a/lib/reconnect.c b/lib/reconnect.c
>>>> index a929ddfd2..f6fd165e4 100644
>>>> --- a/lib/reconnect.c
>>>> +++ b/lib/reconnect.c
>>>> @@ -75,7 +75,8 @@ struct reconnect {
>>>>  
>>>>  static void reconnect_transition__(struct reconnect *, long long int now,
>>>>                                     enum state state);
>>>> -static long long int reconnect_deadline__(const struct reconnect *);
>>>> +static long long int reconnect_deadline__(const struct reconnect *,
>>>> +                                          long long int now);
>>>>  static bool reconnect_may_retry(struct reconnect *);
>>>>  
>>>>  static const char *
>>>> @@ -539,7 +540,7 @@ reconnect_transition__(struct reconnect *fsm, long 
>>>> long int now,
>>>>  }
>>>>  
>>>>  static long long int
>>>> -reconnect_deadline__(const struct reconnect *fsm)
>>>> +reconnect_deadline__(const struct reconnect *fsm, long long int now)
>>>>  {
>>>>      ovs_assert(fsm->state_entered != LLONG_MIN);
>>>>      switch (fsm->state) {
>>>> @@ -557,8 +558,18 @@ reconnect_deadline__(const struct reconnect *fsm)
>>>>          if (fsm->probe_interval) {
>>>>              long long int base = MAX(fsm->last_activity, 
>>>> fsm->state_entered);
>>>>              long long int expiration = base + fsm->probe_interval;
>>>> -            if (fsm->last_receive_attempt >= expiration) {
>>>> +            if (now < expiration || fsm->last_receive_attempt >= 
>>>> expiration) {
>>>> +                /* We still have time before the expiration or the time 
>>>> has
>>>> +                 * already passed and there was no activity.  In the 
>>>> first case
>>>> +                 * we need to wait for the expiration, in the second - 
>>>> we're
>>>> +                 * already past the deadline. */
>>>>                  return expiration;
>>>> +            } else {
>>>> +                /* Time has already passed, but we didn't attempt to 
>>>> receive
>>>> +                 * anything.  We need to wake up and try to receive even 
>>>> if
>>>> +                 * nothing is pending, so we can update the expiration 
>>>> time
>>>> +                 * or transition into S_IDLE state. */
>>>> +                return now + 1;
>>>>              }
>>>>          }
>>>>          return LLONG_MAX;
>>>> @@ -566,8 +577,14 @@ reconnect_deadline__(const struct reconnect *fsm)
>>>>      case S_IDLE:
>>>>          if (fsm->probe_interval) {
>>>>              long long int expiration = fsm->state_entered + 
>>>> fsm->probe_interval;
>>>> -            if (fsm->last_receive_attempt >= expiration) {
>>>> +            if (now < expiration || fsm->last_receive_attempt >= 
>>>> expiration) {
>>>
>>> Should we duplicate or add a similar comment to what we have for S_ACTIVE?
>>
>> I thought about that, but these 'if's are very close to each other and the
>> comment will be exactly the same, so I didn't want to just copy-paste it.
>>
>> Does that make sense?
>>
> 
> I see but then the second comment, for the 'else' in state S_IDLE, also
> seems very similar to the one in state S_ACTIVE.  Can we do the same
> thing for it too?  E.g., change the one in S_ACTIVE to:
> 
> "Time has already passed, but we didn't attempt to receive anything.  We
> need to wake up and try to receive even if nothing is pending, so we can
> update the expiration time or transition to a different state."
> 
> Disconnect happens on a state transition too so the comment should apply
> to both cases.
> 
> In any case, I'm OK with the original patch too, this is just nit
> picking. :)


I like this suggestion, so I implemented it and applied the patch.  Thanks!
Also backported down to 2.15.

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] reconnect: Fix broken inactivity probe if there is no other reason to wake up.

Reply via email to