Re: [ovs-dev] [PATCH v2] ovsdb: raft: Fix transaction double commit due to lost leadership.

Han Zhou Tue, 24 May 2022 17:12:22 -0700

On Tue, May 24, 2022 at 1:18 PM Ilya Maximets <[email protected]> wrote:
>
> While becoming a follower, the leader aborts all the current
> 'in-flight' commands, so the higher layers can re-try corresponding
> transactions when the new leader is elected.  However, most of these
> commands are already sent to followers as append requests, hence they
> will actually be committed by the majority of the cluster members,
> i.e. will be treated as committed by the new leader, unless there is
> an actual network problem between servers.  However, the old leader
> will decline append replies, since it's not the leader anymore and
> commands are already completed with RAFT_CMD_LOST_LEADERSHIP status.
>
> New leader will replicate the commit index back to the old leader.
> Old leader will re-try the previously "failed" transaction, because
> "cluster error"s are temporary.
>
> If a transaction had some prerequisites that didn't allow double
> committing or there are other database constraints (like indexes) that
> will not allow a transaction to be committed twice, the server will
> reply to the client with a false-negative transaction result.
>
> If there are no prerequisites or additional database constraints,
> the server will execute the same transaction again, but as a follower.
>
> E.g. in the OVN case, this may result in creation of duplicated logical
> switches / routers / load balancers.  I.e. resources with the same
> non-indexed name.  That may cause issues later where ovn-nbctl will
> not be able to add ports to these switches / routers.
>
> Suggested solution is to not complete (abort) the commands, but allow
> them to be completed with the commit index update from a new leader.
> It the similar behavior to what we do in order to complete commands
> in a backward scenario when the follower becomes a leader.  That
> scenario was fixed by commit 5a9b53a51ec9 ("ovsdb raft: Fix duplicated
> transaction execution when leader failover.").
>
> Code paths for leader and follower inside the raft_update_commit_index
> were very similar, so they were refactored into one, since we also
> needed an ability to complete more than one command for a follower.
>
> Failure test added to exercise scenario of a leadership transfer.
>
> Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered
databases.")
> Fixes: 3c2d6274bcee ("raft: Transfer leadership before creating
snapshots.")
> Reported-at: https://bugzilla.redhat.com/2046340
> Signed-off-by: Ilya Maximets <[email protected]>
> ---
>
> Version 2:
>   - Updated the commit mesage clarifying the word 'completed'.  [Han]
>   - Dropped the log reset part since it's not really necessary
>     and a bit misleading.  [Han]
>
Thanks Ilya!


Acked-by: Han Zhou <[email protected]>

>  ovsdb/raft.c           | 133 ++++++++++++++++++++++++-----------------
>  tests/ovsdb-cluster.at |  27 +++++++--
>  2 files changed, 99 insertions(+), 61 deletions(-)
>
> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
> index 530c5e5a3..f9fa6443b 100644
> --- a/ovsdb/raft.c
> +++ b/ovsdb/raft.c
> @@ -78,6 +78,8 @@ enum raft_failure_test {
>      FT_DELAY_ELECTION,
>      FT_DONT_SEND_VOTE_REQUEST,
>      FT_STOP_RAFT_RPC,
> +    FT_TRANSFER_LEADERSHIP,
> +    FT_TRANSFER_LEADERSHIP_AFTER_SEND_APPEND_REQ,
>  };
>  static enum raft_failure_test failure_test;
>
> @@ -1931,6 +1933,13 @@ raft_run(struct raft *raft)
>          return;
>      }
>
> +    if (failure_test == FT_TRANSFER_LEADERSHIP) {
> +        /* Using this function as it conveniently implements all we need
and
> +         * snapshotting is the main test scenario for leadership
transfer. */
> +        raft_notify_snapshot_recommended(raft);
> +        failure_test = FT_NO_TEST;
> +    }
> +
>      raft_waiters_run(raft);
>
>      if (!raft->listener && time_msec() >= raft->listen_backoff) {
> @@ -2066,7 +2075,14 @@ raft_run(struct raft *raft)
>          HMAP_FOR_EACH_SAFE (cmd, hmap_node, &raft->commands) {
>              if (cmd->timestamp
>                  && now - cmd->timestamp > raft->election_timer * 2) {
> -                raft_command_complete(raft, cmd, RAFT_CMD_TIMEOUT);
> +                if (cmd->index && raft->role != RAFT_LEADER) {
> +                    /* This server lost leadership and command didn't
complete
> +                     * in time.  Likely, it wasn't replicated to the
majority
> +                     * of servers before losing the leadership. */
> +                    raft_command_complete(raft, cmd,
RAFT_CMD_LOST_LEADERSHIP);
> +                } else {
> +                    raft_command_complete(raft, cmd, RAFT_CMD_TIMEOUT);
> +                }
>              }
>          }
>          raft_reset_ping_timer(raft);
> @@ -2258,6 +2274,9 @@ raft_command_initiate(struct raft *raft,
>      if (failure_test == FT_CRASH_AFTER_SEND_APPEND_REQ) {
>          ovs_fatal(0, "Raft test: crash after sending append_request.");
>      }
> +    if (failure_test == FT_TRANSFER_LEADERSHIP_AFTER_SEND_APPEND_REQ) {
> +        failure_test = FT_TRANSFER_LEADERSHIP;
> +    }
>      raft_reset_ping_timer(raft);
>
>      return cmd;
> @@ -2624,7 +2643,13 @@ raft_become_follower(struct raft *raft)
>       * configuration is already part of the log.  Possibly the
configuration
>       * log entry will not be committed, but until we know that we must
use the
>       * new configuration.  Our AppendEntries processing will properly
update
> -     * the server configuration later, if necessary. */
> +     * the server configuration later, if necessary.
> +     *
> +     * Also we do not complete commands here, as they can still be
completed
> +     * if their log entries have already been replicated to other
servers.
> +     * If the entries were actually committed according to the new
leader, our
> +     * AppendEntries processing will complete the corresponding commands.
> +     */
>      struct raft_server *s;
>      HMAP_FOR_EACH (s, hmap_node, &raft->add_servers) {
>          raft_send_add_server_reply__(raft, &s->sid, s->address, false,
> @@ -2638,8 +2663,6 @@ raft_become_follower(struct raft *raft)
>          raft_server_destroy(raft->remove_server);
>          raft->remove_server = NULL;
>      }
> -
> -    raft_complete_all_commands(raft, RAFT_CMD_LOST_LEADERSHIP);
>  }
>
>  static void
> @@ -2888,61 +2911,56 @@ raft_update_commit_index(struct raft *raft,
uint64_t new_commit_index)
>          return false;
>      }
>
> -    if (raft->role == RAFT_LEADER) {
> -        while (raft->commit_index < new_commit_index) {
> -            uint64_t index = ++raft->commit_index;
> -            const struct raft_entry *e = raft_get_entry(raft, index);
> -            if (raft_entry_has_data(e)) {
> -                struct raft_command *cmd
> -                    = raft_find_command_by_eid(raft, &e->eid);
> -                if (cmd) {
> -                    if (!cmd->index) {
> -                        VLOG_DBG("Command completed after role change
from"
> -                                 " follower to leader "UUID_FMT,
> -                                 UUID_ARGS(&e->eid));
> -                        cmd->index = index;
> -                    }
> -                    raft_command_complete(raft, cmd, RAFT_CMD_SUCCESS);
> +    while (raft->commit_index < new_commit_index) {
> +        uint64_t index = ++raft->commit_index;
> +        const struct raft_entry *e = raft_get_entry(raft, index);
> +
> +        if (raft_entry_has_data(e)) {
> +            struct raft_command *cmd = raft_find_command_by_eid(raft,
&e->eid);
> +            static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5,
5);
> +
> +            if (cmd) {
> +                if (!cmd->index && raft->role == RAFT_LEADER) {
> +                    VLOG_INFO_RL(&rl,
> +                        "command completed after role change from "
> +                        "follower to leader (eid: "UUID_FMT", "
> +                        "commit index: %"PRIu64")", UUID_ARGS(&e->eid),
index);
> +                } else if (!cmd->index && raft->role != RAFT_LEADER) {
> +                    /* This can happen when leader fail-over before
sending
> +                     * execute_command_reply. */
> +                    VLOG_INFO_RL(&rl,
> +                        "command completed without reply (eid:
"UUID_FMT", "
> +                        "commit index: %"PRIu64")", UUID_ARGS(&e->eid),
index);
> +                } else if (cmd->index && raft->role != RAFT_LEADER) {
> +                    /* This can happen if current server lost leadership
after
> +                     * sending append requests to the majority of
servers, but
> +                     * before receiving majority of append replies. */
> +                    VLOG_INFO_RL(&rl,
> +                        "command completed after role change from "
> +                        "leader to follower (eid: "UUID_FMT", "
> +                        "commit index: %"PRIu64")", UUID_ARGS(&e->eid),
index);
> +                    /* Clearing 'sid' to avoid sending cmd execution
reply. */
> +                    cmd->sid = UUID_ZERO;
> +                } else {
> +                    /* (cmd->index && raft->role == RAFT_LEADER)
> +                     * Normal command completion on a leader. */
>                  }
> -            }
> -            if (e->election_timer) {
> -                VLOG_INFO("Election timer changed from %"PRIu64" to
%"PRIu64,
> -                          raft->election_timer, e->election_timer);
> -                raft->election_timer = e->election_timer;
> -                raft->election_timer_new = 0;
> -                raft_update_probe_intervals(raft);
> -            }
> -            if (e->servers) {
> -                /* raft_run_reconfigure() can write a new Raft entry,
which can
> -                 * reallocate raft->entries, which would invalidate 'e',
so
> -                 * this case must be last, after the one for 'e->data'.
*/
> -                raft_run_reconfigure(raft);
> +                cmd->index = index;
> +                raft_command_complete(raft, cmd, RAFT_CMD_SUCCESS);
>              }
>          }
> -    } else {
> -        while (raft->commit_index < new_commit_index) {
> -            uint64_t index = ++raft->commit_index;
> -            const struct raft_entry *e = raft_get_entry(raft, index);
> -            if (e->election_timer) {
> -                VLOG_INFO("Election timer changed from %"PRIu64" to
%"PRIu64,
> -                          raft->election_timer, e->election_timer);
> -                raft->election_timer = e->election_timer;
> -                raft_update_probe_intervals(raft);
> -            }
> +        if (e->election_timer) {
> +            VLOG_INFO("Election timer changed from %"PRIu64" to %"PRIu64,
> +                      raft->election_timer, e->election_timer);
> +            raft->election_timer = e->election_timer;
> +            raft->election_timer_new = 0;
> +            raft_update_probe_intervals(raft);
>          }
> -        /* Check if any pending command can be completed, and complete
it.
> -         * This can happen when leader fail-over before sending
> -         * execute_command_reply. */
> -        const struct uuid *eid = raft_get_eid(raft, new_commit_index);
> -        struct raft_command *cmd = raft_find_command_by_eid(raft, eid);
> -        if (cmd) {
> -            static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5,
5);
> -            VLOG_INFO_RL(&rl,
> -                         "Command completed without reply (eid:
"UUID_FMT", "
> -                         "commit index: %"PRIu64")",
> -                         UUID_ARGS(eid), new_commit_index);
> -            cmd->index = new_commit_index;
> -            raft_command_complete(raft, cmd, RAFT_CMD_SUCCESS);
> +        if (e->servers && raft->role == RAFT_LEADER) {
> +            /* raft_run_reconfigure() can write a new Raft entry, which
can
> +             * reallocate raft->entries, which would invalidate 'e', so
> +             * this case must be last, after the one for 'e->data'. */
> +            raft_run_reconfigure(raft);
>          }
>      }
>
> @@ -4975,6 +4993,11 @@ raft_unixctl_failure_test(struct unixctl_conn
*conn OVS_UNUSED,
>          failure_test = FT_DONT_SEND_VOTE_REQUEST;
>      } else if (!strcmp(test, "stop-raft-rpc")) {
>          failure_test = FT_STOP_RAFT_RPC;
> +    } else if (!strcmp(test,
> +
"transfer-leadership-after-sending-append-request")) {
> +        failure_test = FT_TRANSFER_LEADERSHIP_AFTER_SEND_APPEND_REQ;
> +    } else if (!strcmp(test, "transfer-leadership")) {
> +        failure_test = FT_TRANSFER_LEADERSHIP;
>      } else if (!strcmp(test, "clear")) {
>          failure_test = FT_NO_TEST;
>          unixctl_command_reply(conn, "test dismissed");
> diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at
> index ee9c7b937..0f7076a05 100644
> --- a/tests/ovsdb-cluster.at
> +++ b/tests/ovsdb-cluster.at
> @@ -468,6 +468,7 @@ ovsdb_cluster_failure_test () {
>      if test "$crash_node" == "1"; then
>          new_leader=$5
>      fi
> +    log_grep=$6
>
>      cp $top_srcdir/vswitchd/vswitch.ovsschema schema
>      schema=`ovsdb-tool schema-name schema`
> @@ -488,7 +489,7 @@ ovsdb_cluster_failure_test () {
>      start_server() {
>          local i=$1
>          printf "\ns$i: starting\n"
> -        AT_CHECK([ovsdb-server -vjsonrpc -vconsole:off -vsyslog:off
--detach --no-chdir --log-file=s$i.log --pidfile=s$i.pid --unixctl=s$i
--remote=punix:s$i.ovsdb s$i.db])
> +        AT_CHECK([ovsdb-server -vjsonrpc -vraft -vconsole:off
-vsyslog:off --detach --no-chdir --log-file=s$i.log --pidfile=s$i.pid
--unixctl=s$i --remote=punix:s$i.ovsdb s$i.db])
>      }
>      connect_server() {
>          local i=$1
> @@ -514,14 +515,23 @@ ovsdb_cluster_failure_test () {
>          fi
>          AT_CHECK([ovs-appctl -t "`pwd`"/s$delay_election_node
cluster/failure-test delay-election], [0], [ignore])
>      fi
> +
> +    # Initializing the database separately to avoid extra 'wait'
operation
> +    # in later transactions.
> +    AT_CHECK([ovs-vsctl -v --db="$db" --no-leader-only
--no-shuffle-remotes --no-wait init], [0], [ignore], [ignore])
> +
>      AT_CHECK([ovs-appctl -t "`pwd`"/s$crash_node cluster/failure-test
$crash_command], [0], [ignore])
>      AT_CHECK([ovs-vsctl -v --db="$db" --no-leader-only
--no-shuffle-remotes --no-wait create QoS type=x], [0], [ignore], [ignore])
>
> -    # Make sure that the node really crashed.
> -    AT_CHECK([ls s$crash_node.ovsdb], [2], [ignore], [ignore])
> -    # XXX: Client will fail if remotes contains unix socket that doesn't
exist (killed).
> -    if test "$remote_1" = "$crash_node"; then
> -        db=unix:s$remote_2.ovsdb
> +    # Make sure that the node really crashed or has specific log message.
> +    if test -z "$log_grep"; then
> +        AT_CHECK([ls s$crash_node.ovsdb], [2], [ignore], [ignore])
> +        # XXX: Client will fail if remotes contains unix socket that
doesn't exist (killed).
> +        if test "$remote_1" = "$crash_node"; then
> +            db=unix:s$remote_2.ovsdb
> +        fi
> +    else
> +        OVS_WAIT_UNTIL([grep -q "$log_grep" s${crash_node}.log])
>      fi
>      AT_CHECK([ovs-vsctl --db="$db" --no-leader-only --no-wait
--columns=type --bare list QoS], [0], [x
>  ])
> @@ -617,6 +627,11 @@ AT_KEYWORDS([ovsdb server negative unix cluster
pending-txn])
>  ovsdb_cluster_failure_test 2 2 3
crash-after-receiving-append-request-update
>  AT_CLEANUP
>
> +AT_SETUP([OVSDB cluster - txn on leader, leader transfers leadership
after sending appendReq])
> +AT_KEYWORDS([ovsdb server negative unix cluster pending-txn transfer])
> +ovsdb_cluster_failure_test 1 2 1
transfer-leadership-after-sending-append-request -1 "Transferring
leadership"
> +AT_CLEANUP
> +
>
>  AT_SETUP([OVSDB cluster - competing candidates])
>  AT_KEYWORDS([ovsdb server negative unix cluster competing-candidates])
> --
> 2.34.3
>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v2] ovsdb: raft: Fix transaction double commit due to lost leadership.

Reply via email to