On 7/1/23 04:43, Han Zhou wrote: > > > On Fri, Jun 30, 2023 at 4:30 PM Ilya Maximets <[email protected] > <mailto:[email protected]>> wrote: >> >> On 6/25/23 19:35, Han Zhou wrote: >> > When a server becomes unstable due to overloading or intermittent >> > partitioning, it may miss some heartbeats and then starts election with >> > a new term, which would disrupt the otherwise healthy cluster formed by >> > the rest of the healthy nodes. Such situation may exist for a long time >> > until the "flapping" server is shutdown or recovered completely, which >> > can severely impact the availability of the cluster. The pre-vote >> > mechanism introduced in the raft paper section 9.6 can prevent such >> > problems. This patch implements the pre-vote mechanism, in a way that is >> > backward compatible and safe for upgrades. >> >> Thanks, Han! This is definitely interesting. I don't think that this >> should be a solution for overloading though, overloading should be fixed >> by, well, reducing the load or increasing the election timer. But some >> intermittent network issues is indeed a valid case. > > Sorry that I wasn't clear about "overloading". I was referring to the case > when the machine is getting overloaded, not necessarily because of the > ovsdb-server process but because of other applications/VMs sharing the same > physical resource, which is likely what we have hit in the production. > You may argue it is an improper resource allocation, but from OVSDB point of > view, it is not an excuse for the whole RAFT cluster down. The pre-vote can > avoid such disaster, regardless of the factors of the problematic server that > we can't control.
OK. Makes sense. > >> >> I looked through the code and I don't see any significant problems right >> away, but I would like to experiment with it a bit more next week, since >> the voting logic is fairly complex in general. >> > Appreciated! > >> It's nice to see you found a use case for the stop-raft-rpc failure >> test. :) >> >> See some comments inline. >> >> Best regards, Ilya Maximets. >> >> > >> > Signed-off-by: Han Zhou <[email protected] <mailto:[email protected]>> >> > --- >> > ovsdb/raft-rpc.c | 19 ++++++++- >> > ovsdb/raft-rpc.h | 3 ++ >> > ovsdb/raft.c | 88 ++++++++++++++++++++++++++++++------------ >> > tests/ovsdb-cluster.at <http://ovsdb-cluster.at> | 42 ++++++++++++++++++++ >> > 4 files changed, 127 insertions(+), 25 deletions(-) >> > >> > diff --git a/ovsdb/raft-rpc.c b/ovsdb/raft-rpc.c >> > index dd14d81091fc..f750ba897046 100644 >> > --- a/ovsdb/raft-rpc.c >> > +++ b/ovsdb/raft-rpc.c >> > @@ -283,6 +283,10 @@ raft_vote_request_to_jsonrpc(const struct >> > raft_vote_request *rq, >> > json_object_put(args, "leadership_transfer", >> > json_boolean_create(true)); >> > } >> > + if (rq->is_prevote) { >> > + json_object_put(args, "is_prevote", >> > + json_boolean_create(true)); >> > + } Hmm. I don't think this is fully backward compatible. If the other server is the old one, the ovsdb_parser_finish() in the raft_rpc_from_jsonrpc() will report an error, because not all the members of the json message were consumed. And hence the pre-vote will never succeed in a cluster where upgraded servers do not have a quorum. Let's say we have a cluster of 5. The leader + 2 old servers + 2 new. Leader goes down. 2 new servers don't have a quorum and old ones do not reply to pre-vote requests. I suppose, one of the old ones will have to initiate an election then, and the new ones will reply to the actual vote. So, one of the old servers can become a leader, but new servers can not. This is probably not an issue since the cluster can still elect a leader. What do you think? Maybe a clarification on that will be useful in the commit message. >> > } >> > >> > static void >> > @@ -294,6 +298,8 @@ raft_vote_request_from_jsonrpc(struct ovsdb_parser *p, >> > rq->last_log_term = raft_parse_required_uint64(p, "last_log_term"); >> > rq->leadership_transfer >> > = raft_parse_optional_boolean(p, "leadership_transfer") == 1; >> > + rq->is_prevote >> > + = raft_parse_optional_boolean(p, "is_prevote") == 1; >> > } >> > >> > static void >> > @@ -305,6 +311,9 @@ raft_format_vote_request(const struct >> > raft_vote_request *rq, struct ds *s) >> > if (rq->leadership_transfer) { >> > ds_put_cstr(s, " leadership_transfer=true"); >> > } >> > + if (rq->is_prevote) { >> > + ds_put_cstr(s, " is_prevote=true"); >> > + } >> > } >> > >> > /* raft_vote_reply. */ >> > @@ -326,6 +335,9 @@ raft_vote_reply_to_jsonrpc(const struct >> > raft_vote_reply *rpy, >> > { >> > raft_put_uint64(args, "term", rpy->term); >> > json_object_put_format(args, "vote", UUID_FMT, UUID_ARGS(&rpy->vote)); >> > + if (rpy->is_prevote) { >> > + json_object_put(args, "is_prevote", json_boolean_create(true)); >> > + } >> > } >> > >> > static void >> > @@ -334,6 +346,7 @@ raft_vote_reply_from_jsonrpc(struct ovsdb_parser *p, >> > { >> > rpy->term = raft_parse_required_uint64(p, "term"); >> > rpy->vote = raft_parse_required_uuid(p, "vote"); >> > + rpy->is_prevote = raft_parse_optional_boolean(p, "is_prevote") == 1; >> > } >> > >> > static void >> > @@ -341,6 +354,9 @@ raft_format_vote_reply(const struct raft_vote_reply >> > *rpy, struct ds *s) >> > { >> > ds_put_format(s, " term=%"PRIu64, rpy->term); >> > ds_put_format(s, " vote="SID_FMT, SID_ARGS(&rpy->vote)); >> > + if (rpy->is_prevote) { >> > + ds_put_cstr(s, " is_prevote=true"); >> > + } >> > } >> > >> > /* raft_add_server_request */ >> > @@ -1008,7 +1024,8 @@ raft_rpc_get_vote(const union raft_rpc *rpc) >> > return NULL; >> > >> > case RAFT_RPC_VOTE_REPLY: >> > - return &raft_vote_reply_cast(rpc)->vote; >> > + return raft_vote_reply_cast(rpc)->is_prevote ? NULL : >> > + &raft_vote_reply_cast(rpc)->vote; >> >> It might be better to create a pointer variable here and access it >> instead of casting twice. >> > That was my first attempt, but I can't define a local variable under the > 'case' statement unless I add a pair of '{}' for the whole block, which looks > ugly :( > I will try to find if there are similar patterns in the code base. We have a few places with braces around the case body, it should be fine. You may also define a pointer at the top of the function and assign it in the case. RAFT_RPC_VOTE_REPLY is the main use-case for this function, so having a pointer for it declared at the top should be fine as well. > >> > >> > default: >> > OVS_NOT_REACHED(); >> > diff --git a/ovsdb/raft-rpc.h b/ovsdb/raft-rpc.h >> > index 221f24d00128..10f30618e3e0 100644 >> > --- a/ovsdb/raft-rpc.h >> > +++ b/ovsdb/raft-rpc.h >> > @@ -125,12 +125,15 @@ struct raft_vote_request { >> > uint64_t last_log_index; /* Index of candidate's last log entry. */ >> > uint64_t last_log_term; /* Term of candidate's last log entry. */ >> > bool leadership_transfer; /* True to override minimum election >> > timeout. */ >> > + bool is_prevote; /* True: pre-vote; False: real vote >> > (default). */ >> > }; >> > >> > struct raft_vote_reply { >> > struct raft_rpc_common common; >> > uint64_t term; /* Current term, for candidate to update >> > itself. */ >> > struct uuid vote; /* Server ID of vote. */ >> > + bool is_prevote; /* Copy the is_prevote from the request, >> > primarily >> > + for validation. */ >> >> Each comment line should start with a '*'. >> >> "Copy of the ..." ? >> > > Ack. > >> > }; >> > >> > struct raft_add_server_request { >> > diff --git a/ovsdb/raft.c b/ovsdb/raft.c >> > index b2361b1737a2..4d7a3f112cad 100644 >> > --- a/ovsdb/raft.c >> > +++ b/ovsdb/raft.c >> > @@ -305,6 +305,11 @@ struct raft { >> > >> > /* Candidates only. Reinitialized at start of election. */ >> > int n_votes; /* Number of votes for me. */ >> > + bool prevote_passed; /* Indicates if it passed pre-vote phase. >> > + Pre-vote mechanism is introduced in >> > raft >> > + paper section 9.6. We implement it as a >> > + sub-state of candidate to minize the >> > change >> > + and keep backward compatibility. */ >> >> Each comment line should start with a '*'. >> >> And s/minize/minimize/. >> > Ack. > >> > >> > /* Followers and candidates only. */ >> > bool candidate_retrying; /* The earlier election timed-out and we >> > are >> > @@ -372,7 +377,8 @@ static void raft_become_follower(struct raft *); >> > static void raft_reset_election_timer(struct raft *); >> > static void raft_reset_ping_timer(struct raft *); >> > static void raft_send_heartbeats(struct raft *); >> > -static void raft_start_election(struct raft *, bool leadership_transfer); >> > +static void raft_start_election(struct raft *, bool is_prevote, >> > + bool leadership_transfer); >> > static bool raft_truncate(struct raft *, uint64_t new_end); >> > static void raft_get_servers_from_log(struct raft *, enum vlog_level); >> > static void raft_get_election_timer_from_log(struct raft *); >> > @@ -1069,7 +1075,8 @@ raft_open(struct ovsdb_log *log, struct raft **raftp) >> > /* If there's only one server, start an election right away so >> > that the >> > * cluster bootstraps quickly. */ >> > if (hmap_count(&raft->servers) == 1) { >> > - raft_start_election(raft, false); >> > + /* No pre-vote needed since we are the only one. */ >> > + raft_start_election(raft, false, false); >> > } >> > } else { >> > raft->join_timeout = time_msec() + 1000; >> > @@ -1360,7 +1367,7 @@ void >> > raft_take_leadership(struct raft *raft) >> > { >> > if (raft->role != RAFT_LEADER) { >> > - raft_start_election(raft, true); >> > + raft_start_election(raft, false, true); >> > } >> > } >> > >> > @@ -1766,12 +1773,12 @@ raft_set_term(struct raft *raft, uint64_t term, >> > const struct uuid *vote) >> > return true; >> > } >> > >> > -static void >> > +static bool >> > raft_accept_vote(struct raft *raft, struct raft_server *s, >> > const struct uuid *vote) >> > { >> > if (uuid_equals(&s->vote, vote)) { >> > - return; >> > + return false; >> > } >> > if (!uuid_is_zero(&s->vote)) { >> > static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 1); >> > @@ -1785,13 +1792,18 @@ raft_accept_vote(struct raft *raft, struct >> > raft_server *s, >> > s->vote = *vote; >> > if (uuid_equals(vote, &raft->sid) >> > && ++raft->n_votes > hmap_count(&raft->servers) / 2) { >> > - raft_become_leader(raft); >> > + return true; >> > } >> > + return false; >> > } >> > >> > static void >> > -raft_start_election(struct raft *raft, bool leadership_transfer) >> > +raft_start_election(struct raft *raft, bool is_prevote, >> > + bool leadership_transfer) >> > { >> > + /* Leadership transfer doesn't use prevote. */ >> > + ovs_assert(!is_prevote || !leadership_transfer); >> > + >> > if (raft->leaving) { >> > return; >> > } >> > @@ -1801,7 +1813,7 @@ raft_start_election(struct raft *raft, bool >> > leadership_transfer) >> > return; >> > } >> > >> > - if (!raft_set_term(raft, raft->term + 1, &raft->sid)) { >> > + if (!is_prevote && !raft_set_term(raft, raft->term + 1, &raft->sid)) { >> > return; >> > } >> > >> > @@ -1809,26 +1821,33 @@ raft_start_election(struct raft *raft, bool >> > leadership_transfer) >> > >> > raft->leader_sid = UUID_ZERO; >> > raft->role = RAFT_CANDIDATE; >> > - /* If there was no leader elected since last election, we know we are >> > - * retrying now. */ >> > - raft->candidate_retrying = !raft->had_leader; >> > - raft->had_leader = false; >> > + raft->prevote_passed = !is_prevote; >> > + >> > + if (is_prevote || leadership_transfer) { >> > + /* If there was no leader elected since last election, we know we >> > are >> > + * retrying now. */ >> > + raft->candidate_retrying = !raft->had_leader; >> > + raft->had_leader = false; >> > + >> > + raft->election_start = time_msec(); >> > + raft->election_won = 0; >> > + } >> > >> > raft->n_votes = 0; >> > >> > - raft->election_start = time_msec(); >> > - raft->election_won = 0; >> > raft->leadership_transfer = leadership_transfer; >> > >> > static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 5); >> > if (!VLOG_DROP_INFO(&rl)) { >> > long long int now = time_msec(); >> > + char *comment = is_prevote ? "prevote" : "vote"; >> >> Should user-visible logs say "pre-vote" instead of "prevote" ? > > I am ok either way. I was thinking about the situation when searching the > logs with \<vote\> in vim, but it shouldn't matter much :) I can change it if > pre-vote looks better to users. It was just a bit confusing that all the comments use 'pre-vote', but the log entry is using 'prevote'. An alternative might be to use "starting election" for an actual vote and "starting preliminary election" for a pre-vote. We may also use 'preliminary': true/false, instead of "is_prevote". Might be more user-friendly. 'prevote_passed' may be turned into 'preliminary_vote_passed'. It's on the edge of being too long of a name, not sure what would be a better one though. What do you think? >> >> > if (now >= raft->election_timeout) { >> > VLOG_INFO("term %"PRIu64": %lld ms timeout expired, " >> > - "starting election", >> > - raft->term, now - raft->election_base); >> > + "starting election (%s)", >> > + raft->term, now - raft->election_base, comment); >> > } else { >> > - VLOG_INFO("term %"PRIu64": starting election", raft->term); >> > + VLOG_INFO("term %"PRIu64": starting election (%s)", >> > + raft->term, comment); >> > } >> > } >> > raft_reset_election_timer(raft); >> > @@ -1853,6 +1872,7 @@ raft_start_election(struct raft *raft, bool >> > leadership_transfer) >> > ? raft->entries[raft->log_end - raft->log_start - >> > 1].term >> > : raft->snap.term), >> > .leadership_transfer = leadership_transfer, >> > + .is_prevote = is_prevote, >> > }, >> > }; >> > if (failure_test != FT_DONT_SEND_VOTE_REQUEST) { >> > @@ -1861,7 +1881,13 @@ raft_start_election(struct raft *raft, bool >> > leadership_transfer) >> > } >> > >> > /* Vote for ourselves. */ >> > - raft_accept_vote(raft, me, &raft->sid); >> > + if (raft_accept_vote(raft, me, &raft->sid)) { >> > + /* We just started vote, so it shouldn't be accepted yet unless >> > this is >> > + * a one-node cluster. In such case we don't do pre-vote, and >> > become >> > + * leader immediately. */ >> > + ovs_assert(!is_prevote); >> > + raft_become_leader(raft); >> > + } >> > } >> > >> > static void >> > @@ -2041,10 +2067,10 @@ raft_run(struct raft *raft) >> > raft_reset_election_timer(raft); >> > } else { >> > raft_become_follower(raft); >> > - raft_start_election(raft, false); >> > + raft_start_election(raft, true, false); >> > } >> > } else { >> > - raft_start_election(raft, false); >> > + raft_start_election(raft, true, false); >> > } >> > >> > } >> > @@ -3673,6 +3699,10 @@ raft_handle_vote_request__(struct raft *raft, >> > return false; >> > } >> > >> > + if (rq->is_prevote) { >> > + return true; >> > + } >> > + >> > /* Record a vote for the peer. */ >> > if (!raft_set_term(raft, raft->term, &rq->common.sid)) { >> > return false; >> > @@ -3685,7 +3715,7 @@ raft_handle_vote_request__(struct raft *raft, >> > >> > static void >> > raft_send_vote_reply(struct raft *raft, const struct uuid *dst, >> > - const struct uuid *vote) >> > + const struct uuid *vote, bool is_prevote) >> > { >> > union raft_rpc rpy = { >> > .vote_reply = { >> > @@ -3695,6 +3725,7 @@ raft_send_vote_reply(struct raft *raft, const struct >> > uuid *dst, >> > }, >> > .term = raft->term, >> > .vote = *vote, >> > + .is_prevote = is_prevote, >> > }, >> > }; >> > raft_send(raft, &rpy); >> > @@ -3705,7 +3736,9 @@ raft_handle_vote_request(struct raft *raft, >> > const struct raft_vote_request *rq) >> > { >> > if (raft_handle_vote_request__(raft, rq)) { >> > - raft_send_vote_reply(raft, &rq->common.sid, &raft->vote); >> > + raft_send_vote_reply(raft, &rq->common.sid, >> > + rq->is_prevote ? &rq->common.sid : >> > &raft->vote, >> > + rq->is_prevote); >> > } >> > } >> > >> > @@ -3723,7 +3756,14 @@ raft_handle_vote_reply(struct raft *raft, >> > >> > struct raft_server *s = raft_find_peer(raft, &rpy->common.sid); >> > if (s) { >> > - raft_accept_vote(raft, s, &rpy->vote); >> > + if (raft_accept_vote(raft, s, &rpy->vote)) { >> >> IIUC, the raft_accept_vote() here can't succeed for pre-vote because >> everyone votes for themselves on pre-vote. Is that correct? >> > pre-vote happens the same way as the real vote, except that pre-vote doesn't > increase the term. > This function is for the candidate to handle a vote-reply from a peer for a > vote/pre-vote request sent by itself. > So, raft_accept_vote should succeed in normal cases, for both pre-vote and > vote. > If it is pre-vote and it is successful, it starts the real vote immediately. > >> What happens if we received a real vote during a pre-vote phase for >> some reason? Will this logic fail? Is this scenario possible? > > The logic of handling the vote/pre-vote reply message only depends on the > status of the server itself, regardless of the message's is_prevote field. So > we don't really need to care if it is a pre-vote reply or vote reply. The > is_prevote field in the reply was added only for validation purpose, not > useful for the RPC itself, as mentioned in the comments. Oh, thanks for explanation, I misread the raft_handle_vote_request() the first time. I'll take a closer look. > > Thanks, > Han >> >> > + if (raft->prevote_passed) { >> > + raft_become_leader(raft); >> > + } else { >> > + /* Start the real election. */ >> > + raft_start_election(raft, false, false); >> > + } >> > + } >> > } >> > } >> > >> > @@ -4357,7 +4397,7 @@ raft_handle_become_leader(struct raft *raft, >> > VLOG_INFO("received leadership transfer from %s in term %"PRIu64, >> > raft_get_nickname(raft, &rq->common.sid, buf, sizeof >> > buf), >> > rq->term); >> > - raft_start_election(raft, true); >> > + raft_start_election(raft, false, true); >> > } >> > } >> > >> > diff --git a/tests/ovsdb-cluster.at <http://ovsdb-cluster.at> >> > b/tests/ovsdb-cluster.at <http://ovsdb-cluster.at> >> > index 9fbf5dc897f2..8a81136999e0 100644 >> > --- a/tests/ovsdb-cluster.at <http://ovsdb-cluster.at> >> > +++ b/tests/ovsdb-cluster.at <http://ovsdb-cluster.at> >> > @@ -715,6 +715,48 @@ done >> > >> > AT_CLEANUP >> > >> > + >> > +AT_SETUP([OVSDB cluster - disruptive server]) >> > +AT_KEYWORDS([ovsdb server negative unix cluster disruptive]) >> > + >> > +n=3 >> > +schema_name=`ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema` >> > +ordinal_schema > schema >> > +AT_CHECK([ovsdb-tool '-vPATTERN:console:%c|%p|%m' create-cluster s1.db >> > $abs_srcdir/idltest.ovsschema unix:s1.raft], [0], [], [stderr]) >> > +cid=`ovsdb-tool db-cid s1.db` >> > +schema_name=`ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema` >> > +for i in `seq 2 $n`; do >> > + AT_CHECK([ovsdb-tool join-cluster s$i.db $schema_name unix:s$i.raft >> > unix:s1.raft]) >> > +done >> > + >> > +on_exit 'kill `cat *.pid`' >> > +for i in `seq $n`; do >> > + AT_CHECK([ovsdb-server -v -vconsole:off -vsyslog:off --detach >> > --no-chdir --log-file=s$i.log --pidfile=s$i.pid --unixctl=s$i >> > --remote=punix:s$i.ovsdb s$i.db]) >> > +done >> > +for i in `seq $n`; do >> > + AT_CHECK([ovsdb_client_wait unix:s$i.ovsdb $schema_name connected]) >> > +done >> > + >> > +# An unstable follower shouldn't disrupt the healthy cluster - shouldn't >> > +# trigger term change. >> > +AT_CHECK([ovs-appctl -t "`pwd`"/s2 cluster/failure-test stop-raft-rpc], >> > [0], [ignore]) >> > +OVS_WAIT_UNTIL([ovs-appctl -t "`pwd`"/s2 cluster/status $schema_name | >> > grep "Role: candidate"]) >> > +AT_CHECK([ovs-appctl -t "`pwd`"/s2 cluster/failure-test clear], [0], >> > [ignore]) >> > + >> > +# Should step back to follower. >> > +OVS_WAIT_UNTIL([ovs-appctl -t "`pwd`"/s2 cluster/status $schema_name | >> > grep "Role: follower"]) >> > + >> > +# No term change. >> > +for i in `seq $n`; do >> > + AT_CHECK([ovs-appctl -t "`pwd`"/s$i cluster/status $schema_name | >> > grep "Term: 1"], [0], [ignore]) >> > +done >> > + >> > +for i in `seq $n`; do >> > + OVS_APP_EXIT_AND_WAIT_BY_TARGET([`pwd`/s$i], [s$i.pid]) >> > +done >> > + >> > +AT_CLEANUP >> > + >> > >> > AT_BANNER([OVSDB - cluster tests]) >> > >> _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
