On 7/1/23 04:43, Han Zhou wrote:
> 
> 
> On Fri, Jun 30, 2023 at 4:30 PM Ilya Maximets <[email protected] 
> <mailto:[email protected]>> wrote:
>>
>> On 6/25/23 19:35, Han Zhou wrote:
>> > When a server becomes unstable due to overloading or intermittent
>> > partitioning, it may miss some heartbeats and then starts election with
>> > a new term, which would disrupt the otherwise healthy cluster formed by
>> > the rest of the healthy nodes.  Such situation may exist for a long time
>> > until the "flapping" server is shutdown or recovered completely, which
>> > can severely impact the availability of the cluster. The pre-vote
>> > mechanism introduced in the raft paper section 9.6 can prevent such
>> > problems. This patch implements the pre-vote mechanism, in a way that is
>> > backward compatible and safe for upgrades.
>>
>> Thanks, Han!  This is definitely interesting.  I don't think that this
>> should be a solution for overloading though, overloading should be fixed
>> by, well, reducing the load or increasing the election timer.  But some
>> intermittent network issues is indeed a valid case.
> 
> Sorry that I wasn't clear about "overloading". I was referring to the case 
> when the machine is getting overloaded, not necessarily because of the 
> ovsdb-server process but because of other applications/VMs sharing the same 
> physical resource, which is likely what we have hit in the production.
> You may argue it is an improper resource allocation, but from OVSDB point of 
> view, it is not an excuse for the whole RAFT cluster down. The pre-vote can 
> avoid such disaster, regardless of the factors of the problematic server that 
> we can't control.

OK.  Makes sense.

> 
>>
>> I looked through the code and I don't see any significant problems right
>> away, but I would like to experiment with it a bit more next week, since
>> the voting logic is fairly complex in general.
>>
> Appreciated!
> 
>> It's nice to see you found a use case for the stop-raft-rpc failure
>> test. :)
>>
>> See some comments inline.
>>
>> Best regards, Ilya Maximets.
>>
>> >
>> > Signed-off-by: Han Zhou <[email protected] <mailto:[email protected]>>
>> > ---
>> >  ovsdb/raft-rpc.c       | 19 ++++++++-
>> >  ovsdb/raft-rpc.h       |  3 ++
>> >  ovsdb/raft.c           | 88 ++++++++++++++++++++++++++++++------------
>> >  tests/ovsdb-cluster.at <http://ovsdb-cluster.at> | 42 ++++++++++++++++++++
>> >  4 files changed, 127 insertions(+), 25 deletions(-)
>> >
>> > diff --git a/ovsdb/raft-rpc.c b/ovsdb/raft-rpc.c
>> > index dd14d81091fc..f750ba897046 100644
>> > --- a/ovsdb/raft-rpc.c
>> > +++ b/ovsdb/raft-rpc.c
>> > @@ -283,6 +283,10 @@ raft_vote_request_to_jsonrpc(const struct 
>> > raft_vote_request *rq,
>> >          json_object_put(args, "leadership_transfer",
>> >                          json_boolean_create(true));
>> >      }
>> > +    if (rq->is_prevote) {
>> > +        json_object_put(args, "is_prevote",
>> > +                        json_boolean_create(true));
>> > +    }

Hmm.  I don't think this is fully backward compatible.  If the other
server is the old one, the ovsdb_parser_finish() in the
raft_rpc_from_jsonrpc() will report an error, because not all the members
of the json message were consumed.  And hence the pre-vote will never
succeed in a cluster where upgraded servers do not have a quorum.

Let's say we have a cluster of 5.  The leader + 2 old servers + 2 new.
Leader goes down.  2 new servers don't have a quorum and old ones do
not reply to pre-vote requests.  I suppose, one of the old ones will
have to initiate an election then, and the new ones will reply to the
actual vote.  So, one of the old servers can become a leader, but new
servers can not.  This is probably not an issue since the cluster can
still elect a leader.  What do you think?

Maybe a clarification on that will be useful in the commit message.


>> >  }
>> >
>> >  static void
>> > @@ -294,6 +298,8 @@ raft_vote_request_from_jsonrpc(struct ovsdb_parser *p,
>> >      rq->last_log_term = raft_parse_required_uint64(p, "last_log_term");
>> >      rq->leadership_transfer
>> >          = raft_parse_optional_boolean(p, "leadership_transfer") == 1;
>> > +    rq->is_prevote
>> > +        = raft_parse_optional_boolean(p, "is_prevote") == 1;
>> >  }
>> >
>> >  static void
>> > @@ -305,6 +311,9 @@ raft_format_vote_request(const struct 
>> > raft_vote_request *rq, struct ds *s)
>> >      if (rq->leadership_transfer) {
>> >          ds_put_cstr(s, " leadership_transfer=true");
>> >      }
>> > +    if (rq->is_prevote) {
>> > +        ds_put_cstr(s, " is_prevote=true");
>> > +    }
>> >  }
>> >
>> >  /* raft_vote_reply. */
>> > @@ -326,6 +335,9 @@ raft_vote_reply_to_jsonrpc(const struct 
>> > raft_vote_reply *rpy,
>> >  {
>> >      raft_put_uint64(args, "term", rpy->term);
>> >      json_object_put_format(args, "vote", UUID_FMT, UUID_ARGS(&rpy->vote));
>> > +    if (rpy->is_prevote) {
>> > +        json_object_put(args, "is_prevote", json_boolean_create(true));
>> > +    }
>> >  }
>> >
>> >  static void
>> > @@ -334,6 +346,7 @@ raft_vote_reply_from_jsonrpc(struct ovsdb_parser *p,
>> >  {
>> >      rpy->term = raft_parse_required_uint64(p, "term");
>> >      rpy->vote = raft_parse_required_uuid(p, "vote");
>> > +    rpy->is_prevote = raft_parse_optional_boolean(p, "is_prevote") == 1;
>> >  }
>> >
>> >  static void
>> > @@ -341,6 +354,9 @@ raft_format_vote_reply(const struct raft_vote_reply 
>> > *rpy, struct ds *s)
>> >  {
>> >      ds_put_format(s, " term=%"PRIu64, rpy->term);
>> >      ds_put_format(s, " vote="SID_FMT, SID_ARGS(&rpy->vote));
>> > +    if (rpy->is_prevote) {
>> > +        ds_put_cstr(s, " is_prevote=true");
>> > +    }
>> >  }
>> >
>> >  /* raft_add_server_request */
>> > @@ -1008,7 +1024,8 @@ raft_rpc_get_vote(const union raft_rpc *rpc)
>> >          return NULL;
>> >
>> >      case RAFT_RPC_VOTE_REPLY:
>> > -        return &raft_vote_reply_cast(rpc)->vote;
>> > +        return raft_vote_reply_cast(rpc)->is_prevote ? NULL :
>> > +            &raft_vote_reply_cast(rpc)->vote;
>>
>> It might be better to create a pointer variable here and access it
>> instead of casting twice.
>>
> That was my first attempt, but I can't define a local variable under the 
> 'case' statement unless I add a pair of '{}' for the whole block, which looks 
> ugly :(
> I will try to find if there are similar patterns in the code base.

We have a few places with braces around the case body, it should be
fine.  You may also define a pointer at the top of the function
and assign it in the case.  RAFT_RPC_VOTE_REPLY is the main use-case
for this function, so having a pointer for it declared at the top
should be fine as well.

> 
>> >
>> >      default:
>> >          OVS_NOT_REACHED();
>> > diff --git a/ovsdb/raft-rpc.h b/ovsdb/raft-rpc.h
>> > index 221f24d00128..10f30618e3e0 100644
>> > --- a/ovsdb/raft-rpc.h
>> > +++ b/ovsdb/raft-rpc.h
>> > @@ -125,12 +125,15 @@ struct raft_vote_request {
>> >      uint64_t last_log_index; /* Index of candidate's last log entry. */
>> >      uint64_t last_log_term;  /* Term of candidate's last log entry. */
>> >      bool leadership_transfer;  /* True to override minimum election 
>> > timeout. */
>> > +    bool is_prevote;         /* True: pre-vote; False: real vote 
>> > (default). */
>> >  };
>> >
>> >  struct raft_vote_reply {
>> >      struct raft_rpc_common common;
>> >      uint64_t term;          /* Current term, for candidate to update 
>> > itself. */
>> >      struct uuid vote;       /* Server ID of vote. */
>> > +    bool is_prevote;        /* Copy the is_prevote from the request, 
>> > primarily
>> > +                               for validation. */
>>
>> Each comment line should start with a '*'.
>>
>> "Copy of the ..." ?
>>
> 
> Ack.
> 
>> >  };
>> >
>> >  struct raft_add_server_request {
>> > diff --git a/ovsdb/raft.c b/ovsdb/raft.c
>> > index b2361b1737a2..4d7a3f112cad 100644
>> > --- a/ovsdb/raft.c
>> > +++ b/ovsdb/raft.c
>> > @@ -305,6 +305,11 @@ struct raft {
>> >
>> >      /* Candidates only.  Reinitialized at start of election. */
>> >      int n_votes;                /* Number of votes for me. */
>> > +    bool prevote_passed;        /* Indicates if it passed pre-vote phase.
>> > +                                   Pre-vote mechanism is introduced in 
>> > raft
>> > +                                   paper section 9.6. We implement it as a
>> > +                                   sub-state of candidate to minize the 
>> > change
>> > +                                   and keep backward compatibility. */
>>
>> Each comment line should start with a '*'.
>>
>> And s/minize/minimize/.
>>
> Ack.
> 
>> >
>> >      /* Followers and candidates only. */
>> >      bool candidate_retrying;    /* The earlier election timed-out and we 
>> > are
>> > @@ -372,7 +377,8 @@ static void raft_become_follower(struct raft *);
>> >  static void raft_reset_election_timer(struct raft *);
>> >  static void raft_reset_ping_timer(struct raft *);
>> >  static void raft_send_heartbeats(struct raft *);
>> > -static void raft_start_election(struct raft *, bool leadership_transfer);
>> > +static void raft_start_election(struct raft *, bool is_prevote,
>> > +                                bool leadership_transfer);
>> >  static bool raft_truncate(struct raft *, uint64_t new_end);
>> >  static void raft_get_servers_from_log(struct raft *, enum vlog_level);
>> >  static void raft_get_election_timer_from_log(struct raft *);
>> > @@ -1069,7 +1075,8 @@ raft_open(struct ovsdb_log *log, struct raft **raftp)
>> >          /* If there's only one server, start an election right away so 
>> > that the
>> >           * cluster bootstraps quickly. */
>> >          if (hmap_count(&raft->servers) == 1) {
>> > -            raft_start_election(raft, false);
>> > +            /* No pre-vote needed since we are the only one. */
>> > +            raft_start_election(raft, false, false);
>> >          }
>> >      } else {
>> >          raft->join_timeout = time_msec() + 1000;
>> > @@ -1360,7 +1367,7 @@ void
>> >  raft_take_leadership(struct raft *raft)
>> >  {
>> >      if (raft->role != RAFT_LEADER) {
>> > -        raft_start_election(raft, true);
>> > +        raft_start_election(raft, false, true);
>> >      }
>> >  }
>> >
>> > @@ -1766,12 +1773,12 @@ raft_set_term(struct raft *raft, uint64_t term, 
>> > const struct uuid *vote)
>> >      return true;
>> >  }
>> >
>> > -static void
>> > +static bool
>> >  raft_accept_vote(struct raft *raft, struct raft_server *s,
>> >                   const struct uuid *vote)
>> >  {
>> >      if (uuid_equals(&s->vote, vote)) {
>> > -        return;
>> > +        return false;
>> >      }
>> >      if (!uuid_is_zero(&s->vote)) {
>> >          static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 1);
>> > @@ -1785,13 +1792,18 @@ raft_accept_vote(struct raft *raft, struct 
>> > raft_server *s,
>> >      s->vote = *vote;
>> >      if (uuid_equals(vote, &raft->sid)
>> >          && ++raft->n_votes > hmap_count(&raft->servers) / 2) {
>> > -        raft_become_leader(raft);
>> > +        return true;
>> >      }
>> > +    return false;
>> >  }
>> >
>> >  static void
>> > -raft_start_election(struct raft *raft, bool leadership_transfer)
>> > +raft_start_election(struct raft *raft, bool is_prevote,
>> > +                    bool leadership_transfer)
>> >  {
>> > +    /* Leadership transfer doesn't use prevote. */
>> > +    ovs_assert(!is_prevote || !leadership_transfer);
>> > +
>> >      if (raft->leaving) {
>> >          return;
>> >      }
>> > @@ -1801,7 +1813,7 @@ raft_start_election(struct raft *raft, bool 
>> > leadership_transfer)
>> >          return;
>> >      }
>> >
>> > -    if (!raft_set_term(raft, raft->term + 1, &raft->sid)) {
>> > +    if (!is_prevote && !raft_set_term(raft, raft->term + 1, &raft->sid)) {
>> >          return;
>> >      }
>> >
>> > @@ -1809,26 +1821,33 @@ raft_start_election(struct raft *raft, bool 
>> > leadership_transfer)
>> >
>> >      raft->leader_sid = UUID_ZERO;
>> >      raft->role = RAFT_CANDIDATE;
>> > -    /* If there was no leader elected since last election, we know we are
>> > -     * retrying now. */
>> > -    raft->candidate_retrying = !raft->had_leader;
>> > -    raft->had_leader = false;
>> > +    raft->prevote_passed = !is_prevote;
>> > +
>> > +    if (is_prevote || leadership_transfer) {
>> > +        /* If there was no leader elected since last election, we know we 
>> > are
>> > +         * retrying now. */
>> > +        raft->candidate_retrying = !raft->had_leader;
>> > +        raft->had_leader = false;
>> > +
>> > +        raft->election_start = time_msec();
>> > +        raft->election_won = 0;
>> > +    }
>> >
>> >      raft->n_votes = 0;
>> >
>> > -    raft->election_start = time_msec();
>> > -    raft->election_won = 0;
>> >      raft->leadership_transfer = leadership_transfer;
>> >
>> >      static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 5);
>> >      if (!VLOG_DROP_INFO(&rl)) {
>> >          long long int now = time_msec();
>> > +        char *comment = is_prevote ? "prevote" : "vote";
>>
>> Should user-visible logs say "pre-vote" instead of "prevote" ?
> 
> I am ok either way. I was thinking about the situation when searching the 
> logs with \<vote\> in vim, but it shouldn't matter much :) I can change it if 
> pre-vote looks better to users.

It was just a bit confusing that all the comments use 'pre-vote',
but the log entry is using 'prevote'.

An alternative might be to use "starting election" for an actual
vote and "starting preliminary election" for a pre-vote.

We may also use 'preliminary': true/false, instead of "is_prevote".
Might be more user-friendly.

'prevote_passed' may be turned into 'preliminary_vote_passed'.
It's on the edge of being too long of a name, not sure what would
be a better one though.

What do you think?

>>
>> >          if (now >= raft->election_timeout) {
>> >              VLOG_INFO("term %"PRIu64": %lld ms timeout expired, "
>> > -                      "starting election",
>> > -                      raft->term, now - raft->election_base);
>> > +                      "starting election (%s)",
>> > +                      raft->term, now - raft->election_base, comment);
>> >          } else {
>> > -            VLOG_INFO("term %"PRIu64": starting election", raft->term);
>> > +            VLOG_INFO("term %"PRIu64": starting election (%s)",
>> > +                      raft->term, comment);
>> >          }
>> >      }
>> >      raft_reset_election_timer(raft);
>> > @@ -1853,6 +1872,7 @@ raft_start_election(struct raft *raft, bool 
>> > leadership_transfer)
>> >                      ? raft->entries[raft->log_end - raft->log_start - 
>> > 1].term
>> >                      : raft->snap.term),
>> >                  .leadership_transfer = leadership_transfer,
>> > +                .is_prevote = is_prevote,
>> >              },
>> >          };
>> >          if (failure_test != FT_DONT_SEND_VOTE_REQUEST) {
>> > @@ -1861,7 +1881,13 @@ raft_start_election(struct raft *raft, bool 
>> > leadership_transfer)
>> >      }
>> >
>> >      /* Vote for ourselves. */
>> > -    raft_accept_vote(raft, me, &raft->sid);
>> > +    if (raft_accept_vote(raft, me, &raft->sid)) {
>> > +        /* We just started vote, so it shouldn't be accepted yet unless 
>> > this is
>> > +         * a one-node cluster. In such case we don't do pre-vote, and 
>> > become
>> > +         * leader immediately. */
>> > +        ovs_assert(!is_prevote);
>> > +        raft_become_leader(raft);
>> > +    }
>> >  }
>> >
>> >  static void
>> > @@ -2041,10 +2067,10 @@ raft_run(struct raft *raft)
>> >                  raft_reset_election_timer(raft);
>> >              } else {
>> >                  raft_become_follower(raft);
>> > -                raft_start_election(raft, false);
>> > +                raft_start_election(raft, true, false);
>> >              }
>> >          } else {
>> > -            raft_start_election(raft, false);
>> > +            raft_start_election(raft, true, false);
>> >          }
>> >
>> >      }
>> > @@ -3673,6 +3699,10 @@ raft_handle_vote_request__(struct raft *raft,
>> >          return false;
>> >      }
>> >
>> > +    if (rq->is_prevote) {
>> > +        return true;
>> > +    }
>> > +
>> >      /* Record a vote for the peer. */
>> >      if (!raft_set_term(raft, raft->term, &rq->common.sid)) {
>> >          return false;
>> > @@ -3685,7 +3715,7 @@ raft_handle_vote_request__(struct raft *raft,
>> >
>> >  static void
>> >  raft_send_vote_reply(struct raft *raft, const struct uuid *dst,
>> > -                     const struct uuid *vote)
>> > +                     const struct uuid *vote, bool is_prevote)
>> >  {
>> >      union raft_rpc rpy = {
>> >          .vote_reply = {
>> > @@ -3695,6 +3725,7 @@ raft_send_vote_reply(struct raft *raft, const struct 
>> > uuid *dst,
>> >              },
>> >              .term = raft->term,
>> >              .vote = *vote,
>> > +            .is_prevote = is_prevote,
>> >          },
>> >      };
>> >      raft_send(raft, &rpy);
>> > @@ -3705,7 +3736,9 @@ raft_handle_vote_request(struct raft *raft,
>> >                           const struct raft_vote_request *rq)
>> >  {
>> >      if (raft_handle_vote_request__(raft, rq)) {
>> > -        raft_send_vote_reply(raft, &rq->common.sid, &raft->vote);
>> > +        raft_send_vote_reply(raft, &rq->common.sid,
>> > +                             rq->is_prevote ? &rq->common.sid : 
>> > &raft->vote,
>> > +                             rq->is_prevote);
>> >      }
>> >  }
>> >
>> > @@ -3723,7 +3756,14 @@ raft_handle_vote_reply(struct raft *raft,
>> >
>> >      struct raft_server *s = raft_find_peer(raft, &rpy->common.sid);
>> >      if (s) {
>> > -        raft_accept_vote(raft, s, &rpy->vote);
>> > +        if (raft_accept_vote(raft, s, &rpy->vote)) {
>>
>> IIUC, the raft_accept_vote() here can't succeed for pre-vote because
>> everyone votes for themselves on pre-vote.  Is that correct?
>>
> pre-vote happens the same way as the real vote, except that pre-vote doesn't 
> increase the term.
> This function is for the candidate to handle a vote-reply from a peer for a 
> vote/pre-vote request sent by itself.
> So, raft_accept_vote should succeed in normal cases, for both pre-vote and 
> vote.
> If it is pre-vote and it is successful, it starts the real vote immediately.
> 
>> What happens if we received a real vote during a pre-vote phase for
>> some reason?  Will this logic fail?  Is this scenario possible?
> 
> The logic of handling the vote/pre-vote reply message only depends on the 
> status of the server itself, regardless of the message's is_prevote field. So 
> we don't really need to care if it is a pre-vote reply or vote reply. The 
> is_prevote field in the reply was added only for validation purpose, not 
> useful for the RPC itself, as mentioned in the comments.

Oh, thanks for explanation, I misread the raft_handle_vote_request()
the first time.  I'll take a closer look.

> 
> Thanks,
> Han
>>
>> > +            if (raft->prevote_passed) {
>> > +                raft_become_leader(raft);
>> > +            } else {
>> > +                /* Start the real election. */
>> > +                raft_start_election(raft, false, false);
>> > +            }
>> > +        }
>> >      }
>> >  }
>> >
>> > @@ -4357,7 +4397,7 @@ raft_handle_become_leader(struct raft *raft,
>> >          VLOG_INFO("received leadership transfer from %s in term %"PRIu64,
>> >                    raft_get_nickname(raft, &rq->common.sid, buf, sizeof 
>> > buf),
>> >                    rq->term);
>> > -        raft_start_election(raft, true);
>> > +        raft_start_election(raft, false, true);
>> >      }
>> >  }
>> >
>> > diff --git a/tests/ovsdb-cluster.at <http://ovsdb-cluster.at> 
>> > b/tests/ovsdb-cluster.at <http://ovsdb-cluster.at>
>> > index 9fbf5dc897f2..8a81136999e0 100644
>> > --- a/tests/ovsdb-cluster.at <http://ovsdb-cluster.at>
>> > +++ b/tests/ovsdb-cluster.at <http://ovsdb-cluster.at>
>> > @@ -715,6 +715,48 @@ done
>> >
>> >  AT_CLEANUP
>> >
>> > +
>> > +AT_SETUP([OVSDB cluster - disruptive server])
>> > +AT_KEYWORDS([ovsdb server negative unix cluster disruptive])
>> > +
>> > +n=3
>> > +schema_name=`ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema`
>> > +ordinal_schema > schema
>> > +AT_CHECK([ovsdb-tool '-vPATTERN:console:%c|%p|%m' create-cluster s1.db 
>> > $abs_srcdir/idltest.ovsschema unix:s1.raft], [0], [], [stderr])
>> > +cid=`ovsdb-tool db-cid s1.db`
>> > +schema_name=`ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema`
>> > +for i in `seq 2 $n`; do
>> > +    AT_CHECK([ovsdb-tool join-cluster s$i.db $schema_name unix:s$i.raft 
>> > unix:s1.raft])
>> > +done
>> > +
>> > +on_exit 'kill `cat *.pid`'
>> > +for i in `seq $n`; do
>> > +    AT_CHECK([ovsdb-server -v -vconsole:off -vsyslog:off --detach 
>> > --no-chdir --log-file=s$i.log --pidfile=s$i.pid --unixctl=s$i 
>> > --remote=punix:s$i.ovsdb s$i.db])
>> > +done
>> > +for i in `seq $n`; do
>> > +    AT_CHECK([ovsdb_client_wait unix:s$i.ovsdb $schema_name connected])
>> > +done
>> > +
>> > +# An unstable follower shouldn't disrupt the healthy cluster - shouldn't
>> > +# trigger term change.
>> > +AT_CHECK([ovs-appctl -t "`pwd`"/s2 cluster/failure-test stop-raft-rpc], 
>> > [0], [ignore])
>> > +OVS_WAIT_UNTIL([ovs-appctl -t "`pwd`"/s2 cluster/status $schema_name | 
>> > grep "Role: candidate"])
>> > +AT_CHECK([ovs-appctl -t "`pwd`"/s2 cluster/failure-test clear], [0], 
>> > [ignore])
>> > +
>> > +# Should step back to follower.
>> > +OVS_WAIT_UNTIL([ovs-appctl -t "`pwd`"/s2 cluster/status $schema_name | 
>> > grep "Role: follower"])
>> > +
>> > +# No term change.
>> > +for i in `seq $n`; do
>> > +    AT_CHECK([ovs-appctl -t "`pwd`"/s$i cluster/status $schema_name | 
>> > grep "Term: 1"], [0], [ignore])
>> > +done
>> > +
>> > +for i in `seq $n`; do
>> > +    OVS_APP_EXIT_AND_WAIT_BY_TARGET([`pwd`/s$i], [s$i.pid])
>> > +done
>> > +
>> > +AT_CLEANUP
>> > +
>> >
>> >  AT_BANNER([OVSDB - cluster tests])
>> >
>>

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to