On 2/22/22 17:00, Dumitru Ceara wrote:
> On 2/22/22 16:45, Ilya Maximets wrote:
>> On 2/22/22 13:05, Dumitru Ceara wrote:
>>> On 2/21/22 15:16, Ilya Maximets wrote:
>>>> The purpose of reconnect_deadline__() function is twofold:
>>>>
>>>> 1. Its result is used to tell if the state has to be changed right now
>>>> in reconnect_run().
>>>> 2. Its result also used to determine when the process need to wake up
>>>> and call reconnect_run() for a next time, i.e. when the state may
>>>> need to be changed next time.
>>>>
>>>> Since introduction of the 'receive-attempted' feature, the function
>>>> returns LLONG_MAX if the deadline is in the future. That works for
>>>> the first case, but doesn't for the second one, because we don't
>>>> really know when we need to call reconnect_run().
>>>>
>>>> This is the problem for applications where jsonrpc connection is the
>>>> only source of wake ups, e.g. ovn-northd. When the network goes down
>>>> silently, e.g. server looses IP address due to DHCP failure, ovn-northd
>>>> will sleep in the poll loop indefinitely after being told that it
>>>> doesn't need to call reconnect_run() (deadline == LLONG_MAX).
>>>>
>>>> Fixing that by actually returning the expected time if it is in the
>>>> future, so we will know when to wake up. In order to keep the
>>>> 'receive-attempted' feature, returning 'now + 1' in case where the
>>>> time has already passed, but receive wasn't attempted. That will
>>>> trigger a fast wake up, so the application will be able to attempt the
>>>> receive even if there was no real events. In a correctly written
>>>> application we should not fall into this case more than once in a row.
>>>> '+ 1' ensures that we will not transition into a different state
>>>> prematurely, i.e. before the receive is actually attempted.
>>>>
>>>> Fixes: 4241d652e465 ("jsonrpc: Avoid disconnecting prematurely due to long
>>>> poll intervals.")
>>>> Signed-off-by: Ilya Maximets <[email protected]>
>>>> ---
>>>
>>> Hi Ilya,
>>>
>>> This change looks good to me, I just have a couple minor comment-related
>>> remarks; in any case:
>>>
>>> Acked-by: Dumitru Ceara <[email protected]>
>>>
>>> Regards,
>>> Dumitru
>>>
>>>>
>>>> CC: Wentao Jia
>>>> This might be the problem you tried to fix in northd by setting a probe
>>>> interval. Would be great, if you can test your case with this patch.
>>>>
>>>> lib/reconnect.c | 29 +++++++++---
>>>> python/ovs/reconnect.py | 40 +++++++++++-----
>>>> tests/reconnect.at | 102 +++++++++++++++++++++++++++++++---------
>>>> 3 files changed, 133 insertions(+), 38 deletions(-)
>>>>
>>>> diff --git a/lib/reconnect.c b/lib/reconnect.c
>>>> index a929ddfd2..f6fd165e4 100644
>>>> --- a/lib/reconnect.c
>>>> +++ b/lib/reconnect.c
>>>> @@ -75,7 +75,8 @@ struct reconnect {
>>>>
>>>> static void reconnect_transition__(struct reconnect *, long long int now,
>>>> enum state state);
>>>> -static long long int reconnect_deadline__(const struct reconnect *);
>>>> +static long long int reconnect_deadline__(const struct reconnect *,
>>>> + long long int now);
>>>> static bool reconnect_may_retry(struct reconnect *);
>>>>
>>>> static const char *
>>>> @@ -539,7 +540,7 @@ reconnect_transition__(struct reconnect *fsm, long
>>>> long int now,
>>>> }
>>>>
>>>> static long long int
>>>> -reconnect_deadline__(const struct reconnect *fsm)
>>>> +reconnect_deadline__(const struct reconnect *fsm, long long int now)
>>>> {
>>>> ovs_assert(fsm->state_entered != LLONG_MIN);
>>>> switch (fsm->state) {
>>>> @@ -557,8 +558,18 @@ reconnect_deadline__(const struct reconnect *fsm)
>>>> if (fsm->probe_interval) {
>>>> long long int base = MAX(fsm->last_activity,
>>>> fsm->state_entered);
>>>> long long int expiration = base + fsm->probe_interval;
>>>> - if (fsm->last_receive_attempt >= expiration) {
>>>> + if (now < expiration || fsm->last_receive_attempt >=
>>>> expiration) {
>>>> + /* We still have time before the expiration or the time
>>>> has
>>>> + * already passed and there was no activity. In the
>>>> first case
>>>> + * we need to wait for the expiration, in the second -
>>>> we're
>>>> + * already past the deadline. */
>>>> return expiration;
>>>> + } else {
>>>> + /* Time has already passed, but we didn't attempt to
>>>> receive
>>>> + * anything. We need to wake up and try to receive even
>>>> if
>>>> + * nothing is pending, so we can update the expiration
>>>> time
>>>> + * or transition into S_IDLE state. */
>>>> + return now + 1;
>>>> }
>>>> }
>>>> return LLONG_MAX;
>>>> @@ -566,8 +577,14 @@ reconnect_deadline__(const struct reconnect *fsm)
>>>> case S_IDLE:
>>>> if (fsm->probe_interval) {
>>>> long long int expiration = fsm->state_entered +
>>>> fsm->probe_interval;
>>>> - if (fsm->last_receive_attempt >= expiration) {
>>>> + if (now < expiration || fsm->last_receive_attempt >=
>>>> expiration) {
>>>
>>> Should we duplicate or add a similar comment to what we have for S_ACTIVE?
>>
>> I thought about that, but these 'if's are very close to each other and the
>> comment will be exactly the same, so I didn't want to just copy-paste it.
>>
>> Does that make sense?
>>
>
> I see but then the second comment, for the 'else' in state S_IDLE, also
> seems very similar to the one in state S_ACTIVE. Can we do the same
> thing for it too? E.g., change the one in S_ACTIVE to:
>
> "Time has already passed, but we didn't attempt to receive anything. We
> need to wake up and try to receive even if nothing is pending, so we can
> update the expiration time or transition to a different state."
>
> Disconnect happens on a state transition too so the comment should apply
> to both cases.
>
> In any case, I'm OK with the original patch too, this is just nit
> picking. :)
I like this suggestion, so I implemented it and applied the patch. Thanks!
Also backported down to 2.15.
Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev