On Thu, Apr 11, 2024 at 7:44 PM Ilya Maximets <[email protected]> wrote:
>
> Inactivity probe interval on RAFT connections depend on a value of the
> election timer.  However, the actual value is not known until the
> database snapshot with the RAFT information is received by a joining
> server.  New joining server is using a default 1 second until then.
>
> In case a new joining server is trying to join an existing cluster
> with a large database, it may take more than a second to generate and
> send an initial database snapshot.  This is causing an inability to
> actually join this cluster.  Joining server sends ADD_SERVER request,
> waits 1 second, sends a probe, doesn't get a reply within another
> second, because the leader is busy preparing and sending an initial
> snapshot to it, disconnects, repeat.
>
> This is not an issue for the servers that did already join, since
> their probe intervals are larger than election timeout.
> Cooperative multitasking also doesn't fully solve this issue, since
> it depends on election timer, which is likely higher in the existing
> cluster with a very big database.
>
> Fix that by using the maximum election timer value for inactivity
> probes until the actual value is known.  We still shouldn't completely
> disable the probes, because in the rare event the connection is
> established but the other side silently goes away, we still want to
> disconnect and try to re-establish the connection eventually.
>
> Since probe intervals also depend on the joining state now, update
> them when the server joins the cluster.
>
> Fixes: 14b2b0aad7ae ("raft: Reintroduce jsonrpc inactivity probes.")
> Reported-by: Terry Wilson <[email protected]>
> Reported-at: https://issues.redhat.com/browse/FDP-144
> Signed-off-by: Ilya Maximets <[email protected]>
> ---

Acked-by: Mike Pattrick <[email protected]>

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to