On 3/28/25 9:15 PM, Terry Wilson wrote: > On Thu, Mar 27, 2025 at 5:12 PM Ilya Maximets <i.maxim...@ovn.org> wrote: >> >> Consider a following scenario in a 3-node cluster: >> >> 1. node-1 goes down for some reason without leaving the cluster. >> 2. Some changes are committed to the log. >> 3. node-2 requests to leave the cluster with the cluster/leave. >> 4. node-3 also requests to leave the cluster. >> 5. node-1 comes back online. >> >> In this scenario today the cluster breaks and doesn't recover. >> The reason is that both node-2 and node-3 are in the 'leaving' state, >> so they do not initiate elections. node-1 is behind on log updates >> and can't become a leader. But there is no leader to catch it up. >> >> In order for the cluster to recover in this situation, one of the >> leaving servers must become a leader again, then catch up node-1 with >> the log updates and then node-1 can become a new leader, process both >> server removal requests and become an operational cluster of 1. >> >> The 'leaving' state is not a real state of the server in RAFT and >> the server should still participate in the cluster until the removal >> is committed by the majority of the NEW configuration. This can be >> achieved by allowing the server in a 'leaving' state to initiate >> elections and become a leader again this way. Becoming a leader also >> means that this server will need to be able to execute commands and >> do all the normal tasks of a leader. >> >> Since the leader can't leave the cluster though, this server will need >> to attempt to transfer leadership again in order to actually leave. >> This should be done after a leave timeout. The time between becoming >> a leader and transferring the leadership will allow node-1 to get up >> to speed with all the cluster updates and be ready to become a new >> leader. >> >> Sending a server removal request right after transferring leadership >> can't succeed, because the other server has to go through election >> in order to become a leader before it can process a removal request. >> So, adding a delay between the transfer and the removal request. >> It can be lower than the election timer, just to avoid waiting for >> too long if the election timer has a large value. But it should be >> randomized, so multiple leaving servers do not just bounce the >> leadership without actually getting to the server removal request. >> >> A test is added to stress different scenarios with servers leaving >> while some of the cluster members are down. >> >> Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered >> databases.") >> Reported-at: https://issues.redhat.com/browse/FDP-662 >> Signed-off-by: Ilya Maximets <i.maxim...@ovn.org> >> --- >> ovsdb/raft.c | 52 +++++++++++--- >> tests/ovsdb-cluster.at | 159 +++++++++++++++++++++++++++++++++++++++++ >> 2 files changed, 200 insertions(+), 11 deletions(-) > > Looks good. Thanks! I also tested with my own reproducer script that > in a cluster of 3 has nb1 down and nb2/nb3 trying to leave. Everything > appeared to work as described in the patch. After nb1 comes back up, > nb2 and nb3 successfully leave and the cluster remains healthy. > > Acked-by: Terry Wilson <twil...@redhat.com> > Tested-by: Terry Wilson <twil...@redhat.com>
Thanks, Terry and Mike! Applied and backported down to 3.3. It doesn't apply cleanly to 3.2, so didn't backport there. Can be done later, if necessary. Best regards, Ilya Maximets. _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev