On 2/22/22 13:05, Dumitru Ceara wrote:
> On 2/21/22 15:16, Ilya Maximets wrote:
>> The purpose of reconnect_deadline__() function is twofold:
>>
>> 1. Its result is used to tell if the state has to be changed right now
>> in reconnect_run().
>> 2. Its result also used to determine when the process need to wake up
>> and call reconnect_run() for a next time, i.e. when the state may
>> need to be changed next time.
>>
>> Since introduction of the 'receive-attempted' feature, the function
>> returns LLONG_MAX if the deadline is in the future. That works for
>> the first case, but doesn't for the second one, because we don't
>> really know when we need to call reconnect_run().
>>
>> This is the problem for applications where jsonrpc connection is the
>> only source of wake ups, e.g. ovn-northd. When the network goes down
>> silently, e.g. server looses IP address due to DHCP failure, ovn-northd
>> will sleep in the poll loop indefinitely after being told that it
>> doesn't need to call reconnect_run() (deadline == LLONG_MAX).
>>
>> Fixing that by actually returning the expected time if it is in the
>> future, so we will know when to wake up. In order to keep the
>> 'receive-attempted' feature, returning 'now + 1' in case where the
>> time has already passed, but receive wasn't attempted. That will
>> trigger a fast wake up, so the application will be able to attempt the
>> receive even if there was no real events. In a correctly written
>> application we should not fall into this case more than once in a row.
>> '+ 1' ensures that we will not transition into a different state
>> prematurely, i.e. before the receive is actually attempted.
>>
>> Fixes: 4241d652e465 ("jsonrpc: Avoid disconnecting prematurely due to long
>> poll intervals.")
>> Signed-off-by: Ilya Maximets <[email protected]>
>> ---
>
> Hi Ilya,
>
> This change looks good to me, I just have a couple minor comment-related
> remarks; in any case:
>
> Acked-by: Dumitru Ceara <[email protected]>
>
> Regards,
> Dumitru
>
>>
>> CC: Wentao Jia
>> This might be the problem you tried to fix in northd by setting a probe
>> interval. Would be great, if you can test your case with this patch.
>>
>> lib/reconnect.c | 29 +++++++++---
>> python/ovs/reconnect.py | 40 +++++++++++-----
>> tests/reconnect.at | 102 +++++++++++++++++++++++++++++++---------
>> 3 files changed, 133 insertions(+), 38 deletions(-)
>>
>> diff --git a/lib/reconnect.c b/lib/reconnect.c
>> index a929ddfd2..f6fd165e4 100644
>> --- a/lib/reconnect.c
>> +++ b/lib/reconnect.c
>> @@ -75,7 +75,8 @@ struct reconnect {
>>
>> static void reconnect_transition__(struct reconnect *, long long int now,
>> enum state state);
>> -static long long int reconnect_deadline__(const struct reconnect *);
>> +static long long int reconnect_deadline__(const struct reconnect *,
>> + long long int now);
>> static bool reconnect_may_retry(struct reconnect *);
>>
>> static const char *
>> @@ -539,7 +540,7 @@ reconnect_transition__(struct reconnect *fsm, long long
>> int now,
>> }
>>
>> static long long int
>> -reconnect_deadline__(const struct reconnect *fsm)
>> +reconnect_deadline__(const struct reconnect *fsm, long long int now)
>> {
>> ovs_assert(fsm->state_entered != LLONG_MIN);
>> switch (fsm->state) {
>> @@ -557,8 +558,18 @@ reconnect_deadline__(const struct reconnect *fsm)
>> if (fsm->probe_interval) {
>> long long int base = MAX(fsm->last_activity,
>> fsm->state_entered);
>> long long int expiration = base + fsm->probe_interval;
>> - if (fsm->last_receive_attempt >= expiration) {
>> + if (now < expiration || fsm->last_receive_attempt >=
>> expiration) {
>> + /* We still have time before the expiration or the time has
>> + * already passed and there was no activity. In the first
>> case
>> + * we need to wait for the expiration, in the second - we're
>> + * already past the deadline. */
>> return expiration;
>> + } else {
>> + /* Time has already passed, but we didn't attempt to receive
>> + * anything. We need to wake up and try to receive even if
>> + * nothing is pending, so we can update the expiration time
>> + * or transition into S_IDLE state. */
>> + return now + 1;
>> }
>> }
>> return LLONG_MAX;
>> @@ -566,8 +577,14 @@ reconnect_deadline__(const struct reconnect *fsm)
>> case S_IDLE:
>> if (fsm->probe_interval) {
>> long long int expiration = fsm->state_entered +
>> fsm->probe_interval;
>> - if (fsm->last_receive_attempt >= expiration) {
>> + if (now < expiration || fsm->last_receive_attempt >=
>> expiration) {
>
> Should we duplicate or add a similar comment to what we have for S_ACTIVE?
I thought about that, but these 'if's are very close to each other and the
comment will be exactly the same, so I didn't want to just copy-paste it.
Does that make sense?
Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev