On 6/29/21 1:20 PM, Dumitru Ceara wrote:
> Until now clients that needed to reconnect immediately could only use
> reconnect_force_reconnect().  However, reconnect_force_reconnect()
> doesn't reset the backoff for connections that were alive long enough
> (more than backoff seconds).
> 
> Moreover, the reconnect library cannot determine the exact reason why a
> client wishes to initiate a reconnection.  In most cases reconnection
> happens because of a fatal error when communicating with the remote,
> e.g., in the ovsdb-cs layer, when invalid messages are received from
> ovsdb-server.  In such cases it makes sense to not reset the backoff
> because the remote seems to be unhealthy.
> 
> There are however cases when reconnection is needed for other reasons.
> One such example is when ovsdb-clients require "leader-only" connections
> to clustered ovsdb-server databases.  Whenever the client determines
> that the remote is not a leader anymore, it decides to reconnect to a
> new remote from its list, searching for the new leader.  Using
> jsonrpc_force_reconnect() (which calls reconnect_force_reconnect()) will
> not reset backoff even though the former leader is still likely in good
> shape.
> 
> Since 3c2d6274bcee ("raft: Transfer leadership before creating
> snapshots.") leadership changes inside the clustered database happen
> more often and therefore "leader-only" clients need to reconnect more
> often too.  Not resetting the backoff every time a leadership change
> happens will cause all reconnections to happen with the maximum backoff
> (8 seconds) resulting in significant latency.
> 
> This commit also updates the Python reconnect and IDL implementations
> and adds tests for force-reconnect and graceful-reconnect.
> 
> Reported-at: https://bugzilla.redhat.com/1977264
> Signed-off-by: Dumitru Ceara <[email protected]>
> ---

Hi, Dumitru.

Thanks for working on this issue.  I've seen it in practice while running
OVN tests, but I still don't quiet understand why it happens.  Could you,
please, describe how state transitioning work here for the ovsdb-idl case?

> +# Forcefully reconnect.
> +force-reconnect
> +  in RECONNECT for 0 ms (2000 ms backoff)
> +  1 successful connections out of 3 attempts, seqno 2
> +  disconnected
> +run
> +  should disconnect
> +connecting
> +  in CONNECTING for 0 ms (2000 ms backoff)

Especially this part seems wrong to me.  Because after 'should disconnect'
there should be 'disconnect' of 'connect-fail', but not 'connecting'.  We
literally should disconnect here, otherwise it's a violation of the reconnect
API.  And my concern is that ovsdb-cs or jsonrpc violates the API somewhere
by not calling reconnect_disconnectd() when it is required, or there is some
other bug that makes 'reconnect' module to jump over few states in a fsm.

The logical workflow for the force-reconnect, from what I see in the code
should be:

1. force-reconnect --> transition to S_RECONNECT
2. run -> in S_RECONNECT, so returning RECONNECT_DISCONNECT
3. disconnect -> check the state, update backoff and transition to S_BACKOFF
4. run -> in S_BACKOFF, so returning RECONNECT_CONNECT
5. connected ....

Something is fishy here, because ovsdb-cs somehow jumps over step #3 and
maybe also #4.

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to