On 3/28/25 9:15 PM, Terry Wilson wrote:
> On Thu, Mar 27, 2025 at 5:12 PM Ilya Maximets <i.maxim...@ovn.org> wrote:
>>
>> Consider a following scenario in a 3-node cluster:
>>
>>   1. node-1 goes down for some reason without leaving the cluster.
>>   2. Some changes are committed to the log.
>>   3. node-2 requests to leave the cluster with the cluster/leave.
>>   4. node-3 also requests to leave the cluster.
>>   5. node-1 comes back online.
>>
>> In this scenario today the cluster breaks and doesn't recover.
>> The reason is that both node-2 and node-3 are in the 'leaving' state,
>> so they do not initiate elections.  node-1 is behind on log updates
>> and can't become a leader.  But there is no leader to catch it up.
>>
>> In order for the cluster to recover in this situation, one of the
>> leaving servers must become a leader again, then catch up node-1 with
>> the log updates and then node-1 can become a new leader, process both
>> server removal requests and become an operational cluster of 1.
>>
>> The 'leaving' state is not a real state of the server in RAFT and
>> the server should still participate in the cluster until the removal
>> is committed by the majority of the NEW configuration.  This can be
>> achieved by allowing the server in a 'leaving' state to initiate
>> elections and become a leader again this way.  Becoming a leader also
>> means that this server will need to be able to execute commands and
>> do all the normal tasks of a leader.
>>
>> Since the leader can't leave the cluster though, this server will need
>> to attempt to transfer leadership again in order to actually leave.
>> This should be done after a leave timeout.  The time between becoming
>> a leader and transferring the leadership will allow node-1 to get up
>> to speed with all the cluster updates and be ready to become a new
>> leader.
>>
>> Sending a server removal request right after transferring leadership
>> can't succeed, because the other server has to go through election
>> in order to become a leader before it can process a removal request.
>> So, adding a delay between the transfer and the removal request.
>> It can be lower than the election timer, just to avoid waiting for
>> too long if the election timer has a large value.  But it should be
>> randomized, so multiple leaving servers do not just bounce the
>> leadership without actually getting to the server removal request.
>>
>> A test is added to stress different scenarios with servers leaving
>> while some of the cluster members are down.
>>
>> Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered 
>> databases.")
>> Reported-at: https://issues.redhat.com/browse/FDP-662
>> Signed-off-by: Ilya Maximets <i.maxim...@ovn.org>
>> ---
>>  ovsdb/raft.c           |  52 +++++++++++---
>>  tests/ovsdb-cluster.at | 159 +++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 200 insertions(+), 11 deletions(-)
> 
> Looks good. Thanks! I also tested with my own reproducer script that
> in a cluster of 3 has nb1 down and nb2/nb3 trying to leave. Everything
> appeared to work as described in the patch. After nb1 comes back up,
> nb2 and nb3 successfully leave and the cluster remains healthy.
> 
> Acked-by: Terry Wilson <twil...@redhat.com>
> Tested-by: Terry Wilson <twil...@redhat.com>

Thanks, Terry and Mike!

Applied and backported down to 3.3.  It doesn't apply cleanly to 3.2,
so didn't backport there.  Can be done later, if necessary.

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to