On Thu, Mar 27, 2025 at 5:12 PM Ilya Maximets <i.maxim...@ovn.org> wrote:
>
> Consider a following scenario in a 3-node cluster:
>
>   1. node-1 goes down for some reason without leaving the cluster.
>   2. Some changes are committed to the log.
>   3. node-2 requests to leave the cluster with the cluster/leave.
>   4. node-3 also requests to leave the cluster.
>   5. node-1 comes back online.
>
> In this scenario today the cluster breaks and doesn't recover.
> The reason is that both node-2 and node-3 are in the 'leaving' state,
> so they do not initiate elections.  node-1 is behind on log updates
> and can't become a leader.  But there is no leader to catch it up.
>
> In order for the cluster to recover in this situation, one of the
> leaving servers must become a leader again, then catch up node-1 with
> the log updates and then node-1 can become a new leader, process both
> server removal requests and become an operational cluster of 1.
>
> The 'leaving' state is not a real state of the server in RAFT and
> the server should still participate in the cluster until the removal
> is committed by the majority of the NEW configuration.  This can be
> achieved by allowing the server in a 'leaving' state to initiate
> elections and become a leader again this way.  Becoming a leader also
> means that this server will need to be able to execute commands and
> do all the normal tasks of a leader.
>
> Since the leader can't leave the cluster though, this server will need
> to attempt to transfer leadership again in order to actually leave.
> This should be done after a leave timeout.  The time between becoming
> a leader and transferring the leadership will allow node-1 to get up
> to speed with all the cluster updates and be ready to become a new
> leader.
>
> Sending a server removal request right after transferring leadership
> can't succeed, because the other server has to go through election
> in order to become a leader before it can process a removal request.
> So, adding a delay between the transfer and the removal request.
> It can be lower than the election timer, just to avoid waiting for
> too long if the election timer has a large value.  But it should be
> randomized, so multiple leaving servers do not just bounce the
> leadership without actually getting to the server removal request.
>
> A test is added to stress different scenarios with servers leaving
> while some of the cluster members are down.
>
> Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered 
> databases.")
> Reported-at: https://issues.redhat.com/browse/FDP-662
> Signed-off-by: Ilya Maximets <i.maxim...@ovn.org>
> ---
>  ovsdb/raft.c           |  52 +++++++++++---
>  tests/ovsdb-cluster.at | 159 +++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 200 insertions(+), 11 deletions(-)
>
> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
> index 67eea279a..7d78f710e 100644
> --- a/ovsdb/raft.c
> +++ b/ovsdb/raft.c
> @@ -1377,8 +1377,29 @@ raft_send_remove_server_requests(struct raft *raft)
>              raft_send(raft, &rpc);
>          }
>      }
> +}
> +
> +/* Sends requests required to leave the cluster and schedules the next time
> + * this function should be called. */
> +static void
> +raft_send_leave_requests(struct raft *raft)
> +{
> +    long long int delay = raft->election_timer;
>
> -    raft->leave_timeout = time_msec() + raft->election_timer;
> +    if (raft->role == RAFT_LEADER) {
> +        raft_transfer_leadership(raft, "this server is leaving the cluster");
> +        raft_become_follower(raft);
> +        /* Not sending the RemoveServerRequest right away, because a new
> +         * leader has to be elected first for the request to be successful.
> +         * But setting a shorter delay to avoid waiting for too long when
> +         * the leader re-election is fast.  Randomized to avoid two servers
> +         * bouncing the leadership between each other and never actually
> +         * leaving. */
> +        delay = delay / 10 + random_range(delay / 10);
> +    } else {
> +        raft_send_remove_server_requests(raft);
> +    }
> +    raft->leave_timeout = time_msec() + delay;
>  }
>
>  /* Attempts to start 'raft' leaving its cluster.  The caller can check 
> progress
> @@ -1392,10 +1413,7 @@ raft_leave(struct raft *raft)
>      VLOG_INFO(SID_FMT": starting to leave cluster "CID_FMT,
>                SID_ARGS(&raft->sid), CID_ARGS(&raft->cid));
>      raft->leaving = true;
> -    raft_transfer_leadership(raft, "this server is leaving the cluster");
> -    raft_become_follower(raft);
> -    raft_send_remove_server_requests(raft);
> -    raft->leave_timeout = time_msec() + raft->election_timer;
> +    raft_send_leave_requests(raft);
>  }
>
>  /* Returns true if 'raft' is currently attempting to leave its cluster. */
> @@ -1867,10 +1885,6 @@ raft_start_election(struct raft *raft, bool is_prevote,
>      /* Leadership transfer doesn't use pre-vote. */
>      ovs_assert(!is_prevote || !leadership_transfer);
>
> -    if (raft->leaving) {
> -        return;
> -    }
> -
>      struct raft_server *me = raft_find_server(raft, &raft->sid);
>      if (!me) {
>          return;
> @@ -1997,6 +2011,12 @@ raft_conn_should_stay_open(struct raft *raft, struct 
> raft_conn *conn)
>          return true;
>      }
>
> +    /* Keep the connection until we send a RemoveServerReply. */
> +    if (raft->remove_server
> +        && uuid_equals(&conn->sid, &raft->remove_server->sid)) {
> +        return true;
> +    }
> +
>      /* We have joined the cluster.  If we did that "recently", then there is 
> a
>       * chance that we do not have the most recent server configuration log
>       * entry.  If so, it's a waste to disconnect from the servers that were 
> in
> @@ -2123,6 +2143,8 @@ raft_run(struct raft *raft)
>                      count ++;
>                  }
>              }
> +            VLOG_DBG("%d out of %"PRIuSIZE" servers replied",
> +                      count, hmap_count(&raft->servers));
>              if (count >= hmap_count(&raft->servers) / 2) {
>                  HMAP_FOR_EACH (server, hmap_node, &raft->servers) {
>                      server->replied = false;
> @@ -2139,7 +2161,7 @@ raft_run(struct raft *raft)
>      }
>
>      if (raft->leaving && time_msec() >= raft->leave_timeout) {
> -        raft_send_remove_server_requests(raft);
> +        raft_send_leave_requests(raft);
>      }
>
>      if (raft->joining && time_msec() >= raft->join_timeout) {
> @@ -2452,7 +2474,7 @@ raft_command_execute__(struct raft *raft, const struct 
> json *data,
>                         const struct json *servers, uint64_t election_timer,
>                         const struct uuid *prereq, struct uuid *result)
>  {
> -    if (raft->joining || raft->leaving || raft->left || raft->failed) {
> +    if (raft->joining || raft->left || raft->failed) {
>          return raft_command_create_completed(RAFT_CMD_SHUTDOWN);
>      }
>
> @@ -4175,6 +4197,14 @@ raft_handle_remove_server_request(struct raft *raft,
>          return;
>      }
>
> +    /* Check for the server already being removed. */
> +    if (raft->remove_server
> +        && uuid_equals(&rq->sid, &raft->remove_server->sid)) {
> +        raft_send_remove_server_reply(raft, rq,
> +                                      false, RAFT_SERVER_IN_PROGRESS);
> +        return;
> +    }
> +
>      /* If the server isn't configured, report that. */
>      target = raft_find_server(raft, &rq->sid);
>      if (!target) {
> diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at
> index 9d8b4d06a..91a76cb81 100644
> --- a/tests/ovsdb-cluster.at
> +++ b/tests/ovsdb-cluster.at
> @@ -578,6 +578,165 @@ for i in $(seq $n); do
>      OVS_APP_EXIT_AND_WAIT_BY_TARGET([$(pwd)/s$i], [s$i.pid])
>  done
>
> +AT_CLEANUP
> +
> +AT_BANNER([OVSDB - cluster failure while leaving])
> +AT_SETUP([OVSDB cluster - leaving the cluster with some servers down])
> +AT_KEYWORDS([ovsdb server negative unix cluster leave])
> +
> +AT_CHECK([ovsdb-tool '-vPATTERN:console:%c|%p|%m' create-cluster s1.db \
> +            $top_srcdir/vswitchd/vswitch.ovsschema unix:s1.raft], [0], [], 
> [stderr])
> +schema_name=$(ovsdb-tool schema-name $top_srcdir/vswitchd/vswitch.ovsschema)
> +for i in 2 3 4 5; do
> +    AT_CHECK([ovsdb-tool join-cluster s$i.db $schema_name unix:s$i.raft 
> unix:s1.raft])
> +done
> +
> +on_exit 'kill $(cat *.pid)'
> +on_exit "
> +  for i in \$(ls $(pwd)/s[[0-5]]); do
> +    ovs-appctl --timeout 1 -t \$i cluster/status $schema_name;
> +  done
> +"
> +dnl Starting all the servers.
> +for i in 1 2 3 4 5; do
> +    AT_CHECK([ovsdb-server -v -vconsole:off -vsyslog:off \
> +                           --detach --no-chdir --log-file=s$i.log \
> +                           --pidfile=s$i.pid --unixctl=s$i \
> +                           --remote=punix:s$i.ovsdb s$i.db])
> +done
> +
> +dnl Make sure that all servers joined the cluster.
> +for i in 1 2 3 4 5; do
> +    AT_CHECK([ovsdb_client_wait unix:s$i.ovsdb $schema_name connected])
> +done
> +
> +dnl Make sure the cluster is operational.
> +m4_define([DB_REMOTE], 
> [unix:s1.ovsdb,unix:s2.ovsdb,unix:s3.ovsdb,unix:s4.ovsdb,unix:s5.ovsdb])
> +AT_CHECK([ovs-vsctl --db="DB_REMOTE" --no-wait init])
> +AT_CHECK([ovs-vsctl --db="DB_REMOTE" -vovsdb_cs:console:dbg --no-leader-only 
> \
> +            --no-wait create QoS type=test-1], [0], [ignore], [ignore])
> +
> +dnl Stop servers 1 and 2.
> +OVS_APP_EXIT_AND_WAIT_BY_TARGET([$(pwd)/s1], [s1.pid])
> +OVS_APP_EXIT_AND_WAIT_BY_TARGET([$(pwd)/s2], [s2.pid])
> +
> +dnl Make sure that all remaining servers are functional as a cluster.
> +for i in 3 4 5; do
> +    AT_CHECK([ovsdb_client_wait unix:s$i.ovsdb $schema_name connected])
> +done
> +
> +dnl Make sure the cluster is still operational.
> +m4_define([DB_REMOTE], [unix:s3.ovsdb,unix:s4.ovsdb,unix:s5.ovsdb])
> +AT_CHECK([ovs-vsctl --db="DB_REMOTE" -vovsdb_cs:console:dbg --no-leader-only 
> \
> +            --no-wait create QoS type=test-2], [0], [ignore], [ignore])
> +
> +dnl Servers 1 and 2 in a cluster of 5 are down, 3 servers are still alive.
> +dnl Server 3 can't leave, because the NEW configuration will be a cluster of
> +dnl 4 with 2 servers down and it doesn't have a quorum.  Try it.
> +dnl The cluster will fall apart until servers 1 or 2 come back to resolve
> +dnl the quorum issue, because servers 4 and 5 will no longer consider 3
> +dnl to be part of the configuration.
> +AT_CHECK([ovs-appctl -t $(pwd)/s3 cluster/leave $schema_name])
> +
> +dnl Check that the cluster is not operational.
> +for i in 3 4 5; do
> +    OVS_WAIT_UNTIL([ovs-appctl -t $(pwd)/s$i cluster/status $schema_name \
> +                        | grep -qE 'leaving|disconnected'])
> +done
> +
> +dnl Try to commit a transaction, it should not be successful.
> +m4_define([DB_REMOTE], [unix:s3.ovsdb,unix:s4.ovsdb,unix:s5.ovsdb])
> +AT_CHECK([ovs-vsctl --db="DB_REMOTE" -vovsdb_cs:console:dbg --no-leader-only 
> \
> +            --no-wait create QoS type=test-3], [1], [ignore], [stderr])
> +
> +dnl Now bring back the server 2.  This should allow server 3 to leave.
> +AT_CHECK([ovsdb-server -v -vconsole:off -vsyslog:off \
> +                          --detach --no-chdir --log-file=s2.log \
> +                          --pidfile=s2.pid --unixctl=s2 \
> +                          --remote=punix:s2.ovsdb s2.db])
> +
> +dnl Wait for server 3 to actually leave and stop the server.
> +AT_CHECK([ovsdb_client_wait unix:s3.ovsdb $schema_name removed])
> +OVS_APP_EXIT_AND_WAIT_BY_TARGET([$(pwd)/s3], [s3.pid])
> +
> +dnl Make sure that all remaining servers are functional as a cluster.
> +for i in 2 4 5; do
> +    AT_CHECK([ovsdb_client_wait unix:s$i.ovsdb $schema_name connected])
> +done
> +dnl Make sure the cluster is operational again.
> +m4_define([DB_REMOTE], [unix:s2.ovsdb,unix:s4.ovsdb,unix:s5.ovsdb])
> +AT_CHECK([ovs-vsctl --db="DB_REMOTE" -vovsdb_cs:console:dbg --no-leader-only 
> \
> +            --no-wait create QoS type=test-4], [0], [ignore], [ignore])
> +
> +dnl Now we have a cluster of 4 servers (1, 2, 4, 5) with 1 server down.
> +dnl Server 2 should be able to leave, because the NEW configuration will
> +dnl be a cluster of 3 servers with 1 being down and it has a quorum.
> +AT_CHECK([ovs-appctl -t $(pwd)/s2 cluster/leave $schema_name])
> +dnl Wait for server 2 to actually leave and stop the server.
> +AT_CHECK([ovsdb_client_wait unix:s2.ovsdb $schema_name removed])
> +OVS_APP_EXIT_AND_WAIT_BY_TARGET([$(pwd)/s2], [s2.pid])
> +
> +dnl Make sure the cluster is still operational.
> +m4_define([DB_REMOTE], [unix:s4.ovsdb,unix:s5.ovsdb])
> +AT_CHECK([ovs-vsctl --db="DB_REMOTE" -vovsdb_cs:console:dbg --no-leader-only 
> \
> +            --no-wait create QoS type=test-5], [0], [ignore], [ignore])
> +
> +dnl Now we have a cluster of 3 servers (1, 4, 5) with 1 server down.
> +dnl None of the alive servers can leave, because the NEW configuration
> +dnl will be a cluster of 2 with 1 server down and it has no quorum.
> +dnl Request both to leave anyway.
> +for i in 4 5; do
> +    AT_CHECK([ovs-appctl -t $(pwd)/s$i cluster/leave $schema_name])
> +done
> +
> +dnl Check that the cluster is not operational.
> +for i in 4 5; do
> +    OVS_WAIT_UNTIL([ovs-appctl -t $(pwd)/s$i cluster/status $schema_name \
> +                        | grep -qE 'leaving|disconnected'])
> +done
> +
> +dnl Try to commit a transaction, it should not be successful.
> +m4_define([DB_REMOTE], [unix:s4.ovsdb,unix:s5.ovsdb])
> +AT_CHECK([ovs-vsctl --db="DB_REMOTE" -vovsdb_cs:console:dbg --no-leader-only 
> \
> +            --no-wait create QoS type=test-6], [1], [ignore], [stderr])
> +
> +dnl Now bring back the first server.
> +AT_CHECK([ovsdb-server -v -vconsole:off -vsyslog:off \
> +                          --detach --no-chdir --log-file=s1.log \
> +                          --pidfile=s1.pid --unixctl=s1 \
> +                          --remote=punix:s1.ovsdb s1.db])
> +
> +dnl Now it should be possible for all the other servers to leave, so we
> +dnl should end up with a single-node cluster that consists of server 1.
> +for i in 4 5; do
> +    AT_CHECK([ovsdb_client_wait unix:s$i.ovsdb $schema_name removed])
> +done
> +for i in 4 5; do
> +    OVS_APP_EXIT_AND_WAIT_BY_TARGET([$(pwd)/s$i], [s$i.pid])
> +done
> +
> +dnl Wait for the first server to become a leader of a single-node cluster.
> +OVS_WAIT_UNTIL([ovs-appctl -t $(pwd)/s1 cluster/status $schema_name \
> +                    | grep -q 'Role: leader'])
> +AT_CHECK([ovs-appctl -t $(pwd)/s1 cluster/status $schema_name \
> +            | grep -c '    s[[1-5]] '], [0], [dnl
> +1
> +])
> +
> +dnl Check that the database is operational and the data is still in there.
> +m4_define([DB_REMOTE], [unix:s1.ovsdb])
> +AT_CHECK([ovs-vsctl --db="DB_REMOTE" -vovsdb_cs:console:dbg --no-wait \
> +            create QoS type=test-7], [0], [ignore], [ignore])
> +AT_CHECK([ovs-vsctl --db="DB_REMOTE" --no-wait \
> +            --columns=type --bare list QoS | sed '/^$/d' | sort], [0], [dnl
> +test-1
> +test-2
> +test-4
> +test-5
> +test-7
> +])
> +
> +OVS_APP_EXIT_AND_WAIT_BY_TARGET([$(pwd)/s1], [s1.pid])
>  AT_CLEANUP
>
>
> --
> 2.47.0
>
> _______________________________________________
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Looks good. Thanks! I also tested with my own reproducer script that
in a cluster of 3 has nb1 down and nb2/nb3 trying to leave. Everything
appeared to work as described in the patch. After nb1 comes back up,
nb2 and nb3 successfully leave and the cluster remains healthy.

Acked-by: Terry Wilson <twil...@redhat.com>
Tested-by: Terry Wilson <twil...@redhat.com>

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to