On Tue, May 24, 2022 at 1:18 PM Ilya Maximets <[email protected]> wrote: > > While becoming a follower, the leader aborts all the current > 'in-flight' commands, so the higher layers can re-try corresponding > transactions when the new leader is elected. However, most of these > commands are already sent to followers as append requests, hence they > will actually be committed by the majority of the cluster members, > i.e. will be treated as committed by the new leader, unless there is > an actual network problem between servers. However, the old leader > will decline append replies, since it's not the leader anymore and > commands are already completed with RAFT_CMD_LOST_LEADERSHIP status. > > New leader will replicate the commit index back to the old leader. > Old leader will re-try the previously "failed" transaction, because > "cluster error"s are temporary. > > If a transaction had some prerequisites that didn't allow double > committing or there are other database constraints (like indexes) that > will not allow a transaction to be committed twice, the server will > reply to the client with a false-negative transaction result. > > If there are no prerequisites or additional database constraints, > the server will execute the same transaction again, but as a follower. > > E.g. in the OVN case, this may result in creation of duplicated logical > switches / routers / load balancers. I.e. resources with the same > non-indexed name. That may cause issues later where ovn-nbctl will > not be able to add ports to these switches / routers. > > Suggested solution is to not complete (abort) the commands, but allow > them to be completed with the commit index update from a new leader. > It the similar behavior to what we do in order to complete commands > in a backward scenario when the follower becomes a leader. That > scenario was fixed by commit 5a9b53a51ec9 ("ovsdb raft: Fix duplicated > transaction execution when leader failover."). > > Code paths for leader and follower inside the raft_update_commit_index > were very similar, so they were refactored into one, since we also > needed an ability to complete more than one command for a follower. > > Failure test added to exercise scenario of a leadership transfer. > > Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.") > Fixes: 3c2d6274bcee ("raft: Transfer leadership before creating snapshots.") > Reported-at: https://bugzilla.redhat.com/2046340 > Signed-off-by: Ilya Maximets <[email protected]> > --- > > Version 2: > - Updated the commit mesage clarifying the word 'completed'. [Han] > - Dropped the log reset part since it's not really necessary > and a bit misleading. [Han] > Thanks Ilya!
Acked-by: Han Zhou <[email protected]> > ovsdb/raft.c | 133 ++++++++++++++++++++++++----------------- > tests/ovsdb-cluster.at | 27 +++++++-- > 2 files changed, 99 insertions(+), 61 deletions(-) > > diff --git a/ovsdb/raft.c b/ovsdb/raft.c > index 530c5e5a3..f9fa6443b 100644 > --- a/ovsdb/raft.c > +++ b/ovsdb/raft.c > @@ -78,6 +78,8 @@ enum raft_failure_test { > FT_DELAY_ELECTION, > FT_DONT_SEND_VOTE_REQUEST, > FT_STOP_RAFT_RPC, > + FT_TRANSFER_LEADERSHIP, > + FT_TRANSFER_LEADERSHIP_AFTER_SEND_APPEND_REQ, > }; > static enum raft_failure_test failure_test; > > @@ -1931,6 +1933,13 @@ raft_run(struct raft *raft) > return; > } > > + if (failure_test == FT_TRANSFER_LEADERSHIP) { > + /* Using this function as it conveniently implements all we need and > + * snapshotting is the main test scenario for leadership transfer. */ > + raft_notify_snapshot_recommended(raft); > + failure_test = FT_NO_TEST; > + } > + > raft_waiters_run(raft); > > if (!raft->listener && time_msec() >= raft->listen_backoff) { > @@ -2066,7 +2075,14 @@ raft_run(struct raft *raft) > HMAP_FOR_EACH_SAFE (cmd, hmap_node, &raft->commands) { > if (cmd->timestamp > && now - cmd->timestamp > raft->election_timer * 2) { > - raft_command_complete(raft, cmd, RAFT_CMD_TIMEOUT); > + if (cmd->index && raft->role != RAFT_LEADER) { > + /* This server lost leadership and command didn't complete > + * in time. Likely, it wasn't replicated to the majority > + * of servers before losing the leadership. */ > + raft_command_complete(raft, cmd, RAFT_CMD_LOST_LEADERSHIP); > + } else { > + raft_command_complete(raft, cmd, RAFT_CMD_TIMEOUT); > + } > } > } > raft_reset_ping_timer(raft); > @@ -2258,6 +2274,9 @@ raft_command_initiate(struct raft *raft, > if (failure_test == FT_CRASH_AFTER_SEND_APPEND_REQ) { > ovs_fatal(0, "Raft test: crash after sending append_request."); > } > + if (failure_test == FT_TRANSFER_LEADERSHIP_AFTER_SEND_APPEND_REQ) { > + failure_test = FT_TRANSFER_LEADERSHIP; > + } > raft_reset_ping_timer(raft); > > return cmd; > @@ -2624,7 +2643,13 @@ raft_become_follower(struct raft *raft) > * configuration is already part of the log. Possibly the configuration > * log entry will not be committed, but until we know that we must use the > * new configuration. Our AppendEntries processing will properly update > - * the server configuration later, if necessary. */ > + * the server configuration later, if necessary. > + * > + * Also we do not complete commands here, as they can still be completed > + * if their log entries have already been replicated to other servers. > + * If the entries were actually committed according to the new leader, our > + * AppendEntries processing will complete the corresponding commands. > + */ > struct raft_server *s; > HMAP_FOR_EACH (s, hmap_node, &raft->add_servers) { > raft_send_add_server_reply__(raft, &s->sid, s->address, false, > @@ -2638,8 +2663,6 @@ raft_become_follower(struct raft *raft) > raft_server_destroy(raft->remove_server); > raft->remove_server = NULL; > } > - > - raft_complete_all_commands(raft, RAFT_CMD_LOST_LEADERSHIP); > } > > static void > @@ -2888,61 +2911,56 @@ raft_update_commit_index(struct raft *raft, uint64_t new_commit_index) > return false; > } > > - if (raft->role == RAFT_LEADER) { > - while (raft->commit_index < new_commit_index) { > - uint64_t index = ++raft->commit_index; > - const struct raft_entry *e = raft_get_entry(raft, index); > - if (raft_entry_has_data(e)) { > - struct raft_command *cmd > - = raft_find_command_by_eid(raft, &e->eid); > - if (cmd) { > - if (!cmd->index) { > - VLOG_DBG("Command completed after role change from" > - " follower to leader "UUID_FMT, > - UUID_ARGS(&e->eid)); > - cmd->index = index; > - } > - raft_command_complete(raft, cmd, RAFT_CMD_SUCCESS); > + while (raft->commit_index < new_commit_index) { > + uint64_t index = ++raft->commit_index; > + const struct raft_entry *e = raft_get_entry(raft, index); > + > + if (raft_entry_has_data(e)) { > + struct raft_command *cmd = raft_find_command_by_eid(raft, &e->eid); > + static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 5); > + > + if (cmd) { > + if (!cmd->index && raft->role == RAFT_LEADER) { > + VLOG_INFO_RL(&rl, > + "command completed after role change from " > + "follower to leader (eid: "UUID_FMT", " > + "commit index: %"PRIu64")", UUID_ARGS(&e->eid), index); > + } else if (!cmd->index && raft->role != RAFT_LEADER) { > + /* This can happen when leader fail-over before sending > + * execute_command_reply. */ > + VLOG_INFO_RL(&rl, > + "command completed without reply (eid: "UUID_FMT", " > + "commit index: %"PRIu64")", UUID_ARGS(&e->eid), index); > + } else if (cmd->index && raft->role != RAFT_LEADER) { > + /* This can happen if current server lost leadership after > + * sending append requests to the majority of servers, but > + * before receiving majority of append replies. */ > + VLOG_INFO_RL(&rl, > + "command completed after role change from " > + "leader to follower (eid: "UUID_FMT", " > + "commit index: %"PRIu64")", UUID_ARGS(&e->eid), index); > + /* Clearing 'sid' to avoid sending cmd execution reply. */ > + cmd->sid = UUID_ZERO; > + } else { > + /* (cmd->index && raft->role == RAFT_LEADER) > + * Normal command completion on a leader. */ > } > - } > - if (e->election_timer) { > - VLOG_INFO("Election timer changed from %"PRIu64" to %"PRIu64, > - raft->election_timer, e->election_timer); > - raft->election_timer = e->election_timer; > - raft->election_timer_new = 0; > - raft_update_probe_intervals(raft); > - } > - if (e->servers) { > - /* raft_run_reconfigure() can write a new Raft entry, which can > - * reallocate raft->entries, which would invalidate 'e', so > - * this case must be last, after the one for 'e->data'. */ > - raft_run_reconfigure(raft); > + cmd->index = index; > + raft_command_complete(raft, cmd, RAFT_CMD_SUCCESS); > } > } > - } else { > - while (raft->commit_index < new_commit_index) { > - uint64_t index = ++raft->commit_index; > - const struct raft_entry *e = raft_get_entry(raft, index); > - if (e->election_timer) { > - VLOG_INFO("Election timer changed from %"PRIu64" to %"PRIu64, > - raft->election_timer, e->election_timer); > - raft->election_timer = e->election_timer; > - raft_update_probe_intervals(raft); > - } > + if (e->election_timer) { > + VLOG_INFO("Election timer changed from %"PRIu64" to %"PRIu64, > + raft->election_timer, e->election_timer); > + raft->election_timer = e->election_timer; > + raft->election_timer_new = 0; > + raft_update_probe_intervals(raft); > } > - /* Check if any pending command can be completed, and complete it. > - * This can happen when leader fail-over before sending > - * execute_command_reply. */ > - const struct uuid *eid = raft_get_eid(raft, new_commit_index); > - struct raft_command *cmd = raft_find_command_by_eid(raft, eid); > - if (cmd) { > - static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 5); > - VLOG_INFO_RL(&rl, > - "Command completed without reply (eid: "UUID_FMT", " > - "commit index: %"PRIu64")", > - UUID_ARGS(eid), new_commit_index); > - cmd->index = new_commit_index; > - raft_command_complete(raft, cmd, RAFT_CMD_SUCCESS); > + if (e->servers && raft->role == RAFT_LEADER) { > + /* raft_run_reconfigure() can write a new Raft entry, which can > + * reallocate raft->entries, which would invalidate 'e', so > + * this case must be last, after the one for 'e->data'. */ > + raft_run_reconfigure(raft); > } > } > > @@ -4975,6 +4993,11 @@ raft_unixctl_failure_test(struct unixctl_conn *conn OVS_UNUSED, > failure_test = FT_DONT_SEND_VOTE_REQUEST; > } else if (!strcmp(test, "stop-raft-rpc")) { > failure_test = FT_STOP_RAFT_RPC; > + } else if (!strcmp(test, > + "transfer-leadership-after-sending-append-request")) { > + failure_test = FT_TRANSFER_LEADERSHIP_AFTER_SEND_APPEND_REQ; > + } else if (!strcmp(test, "transfer-leadership")) { > + failure_test = FT_TRANSFER_LEADERSHIP; > } else if (!strcmp(test, "clear")) { > failure_test = FT_NO_TEST; > unixctl_command_reply(conn, "test dismissed"); > diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at > index ee9c7b937..0f7076a05 100644 > --- a/tests/ovsdb-cluster.at > +++ b/tests/ovsdb-cluster.at > @@ -468,6 +468,7 @@ ovsdb_cluster_failure_test () { > if test "$crash_node" == "1"; then > new_leader=$5 > fi > + log_grep=$6 > > cp $top_srcdir/vswitchd/vswitch.ovsschema schema > schema=`ovsdb-tool schema-name schema` > @@ -488,7 +489,7 @@ ovsdb_cluster_failure_test () { > start_server() { > local i=$1 > printf "\ns$i: starting\n" > - AT_CHECK([ovsdb-server -vjsonrpc -vconsole:off -vsyslog:off --detach --no-chdir --log-file=s$i.log --pidfile=s$i.pid --unixctl=s$i --remote=punix:s$i.ovsdb s$i.db]) > + AT_CHECK([ovsdb-server -vjsonrpc -vraft -vconsole:off -vsyslog:off --detach --no-chdir --log-file=s$i.log --pidfile=s$i.pid --unixctl=s$i --remote=punix:s$i.ovsdb s$i.db]) > } > connect_server() { > local i=$1 > @@ -514,14 +515,23 @@ ovsdb_cluster_failure_test () { > fi > AT_CHECK([ovs-appctl -t "`pwd`"/s$delay_election_node cluster/failure-test delay-election], [0], [ignore]) > fi > + > + # Initializing the database separately to avoid extra 'wait' operation > + # in later transactions. > + AT_CHECK([ovs-vsctl -v --db="$db" --no-leader-only --no-shuffle-remotes --no-wait init], [0], [ignore], [ignore]) > + > AT_CHECK([ovs-appctl -t "`pwd`"/s$crash_node cluster/failure-test $crash_command], [0], [ignore]) > AT_CHECK([ovs-vsctl -v --db="$db" --no-leader-only --no-shuffle-remotes --no-wait create QoS type=x], [0], [ignore], [ignore]) > > - # Make sure that the node really crashed. > - AT_CHECK([ls s$crash_node.ovsdb], [2], [ignore], [ignore]) > - # XXX: Client will fail if remotes contains unix socket that doesn't exist (killed). > - if test "$remote_1" = "$crash_node"; then > - db=unix:s$remote_2.ovsdb > + # Make sure that the node really crashed or has specific log message. > + if test -z "$log_grep"; then > + AT_CHECK([ls s$crash_node.ovsdb], [2], [ignore], [ignore]) > + # XXX: Client will fail if remotes contains unix socket that doesn't exist (killed). > + if test "$remote_1" = "$crash_node"; then > + db=unix:s$remote_2.ovsdb > + fi > + else > + OVS_WAIT_UNTIL([grep -q "$log_grep" s${crash_node}.log]) > fi > AT_CHECK([ovs-vsctl --db="$db" --no-leader-only --no-wait --columns=type --bare list QoS], [0], [x > ]) > @@ -617,6 +627,11 @@ AT_KEYWORDS([ovsdb server negative unix cluster pending-txn]) > ovsdb_cluster_failure_test 2 2 3 crash-after-receiving-append-request-update > AT_CLEANUP > > +AT_SETUP([OVSDB cluster - txn on leader, leader transfers leadership after sending appendReq]) > +AT_KEYWORDS([ovsdb server negative unix cluster pending-txn transfer]) > +ovsdb_cluster_failure_test 1 2 1 transfer-leadership-after-sending-append-request -1 "Transferring leadership" > +AT_CLEANUP > + > > AT_SETUP([OVSDB cluster - competing candidates]) > AT_KEYWORDS([ovsdb server negative unix cluster competing-candidates]) > -- > 2.34.3 > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
