On Thu, Apr 11, 2024 at 7:44 PM Ilya Maximets <[email protected]> wrote: > > Inactivity probe interval on RAFT connections depend on a value of the > election timer. However, the actual value is not known until the > database snapshot with the RAFT information is received by a joining > server. New joining server is using a default 1 second until then. > > In case a new joining server is trying to join an existing cluster > with a large database, it may take more than a second to generate and > send an initial database snapshot. This is causing an inability to > actually join this cluster. Joining server sends ADD_SERVER request, > waits 1 second, sends a probe, doesn't get a reply within another > second, because the leader is busy preparing and sending an initial > snapshot to it, disconnects, repeat. > > This is not an issue for the servers that did already join, since > their probe intervals are larger than election timeout. > Cooperative multitasking also doesn't fully solve this issue, since > it depends on election timer, which is likely higher in the existing > cluster with a very big database. > > Fix that by using the maximum election timer value for inactivity > probes until the actual value is known. We still shouldn't completely > disable the probes, because in the rare event the connection is > established but the other side silently goes away, we still want to > disconnect and try to re-establish the connection eventually. > > Since probe intervals also depend on the joining state now, update > them when the server joins the cluster. > > Fixes: 14b2b0aad7ae ("raft: Reintroduce jsonrpc inactivity probes.") > Reported-by: Terry Wilson <[email protected]> > Reported-at: https://issues.redhat.com/browse/FDP-144 > Signed-off-by: Ilya Maximets <[email protected]> > ---
Acked-by: Mike Pattrick <[email protected]> _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
