On Fri, Jan 30, 2026 at 5:34 PM Ilya Maximets <[email protected]> wrote: > > We recently fixed a couple issues with actually ignoring a disruptive > server as well as ignoring pre-vote replies during the actual vote: > > c34c21bb0184 ("ovsdb: raft: Actually suppress the disruptive server.") > 5f12cd410acf ("ovsdb: raft: Discard pre-vote replies during the actual election.") > > Without both of these fixes, the following scenario was possible: > > 1. A cluster with 3 servers: A (leader), B and C. Term: X. > 2. C goes down. > 3. A and B commit extra data --> C now has an outdated log. > 4. A transfers leadership and goes down. > 5. B initiates the election increasing the term to X+1 (no pre-vote). > 6. B goes down. > 7. Now the whole cluster is down with database files containing terms > X, X+1, and X, accordingly. Log on C is behind. > 8. All servers go back up. > 9. C initiates pre-vote on term X. > 10. A sends pre-vote for C on term X (we do not compare the log yet). > 11. B sends pre-vote for C on term X+1, because in the absence of > commit c34c21bb0184, we send a reply even if the term doesn't > match as long as request is considered disruptive. > 12. C receives pre-vote for itself from A on term X. > 13. C now has 2 out of 3 pre-votes (self and A). > 14. C immediately initiates the actual vote on term X+1. > 15. C receives pre-vote for itself from B on term X+1. > 16. In the absence of commit 5f12cd410acf, C treats the pre-vote > from B as an actual vote. > 17. C now thinks that it has 2 out of 3 actual votes and declares > itself a leader for term X+1. > 18. A doesn't send an actual vote, because C has outdated log. > 19. B sends an actual vote reply voting for itself, because it > already voted on term X+1 for itself at step 5. > 20. C, as a leader, ignores the extra vote from B. > 21. C sends append requests to A and B with an outdated log. > 22. A and B acknowledge a new leader. > 23. A and B attempt to truncate their logs below the commit index. > 24. A and B crash on assertion failure and can't recover, because > the illegal truncation is part of their logs now. > > In this situation it may also be possible to have two leaders > elected at the same time, in case A or B elect themselves before > C sends the first append request, as they never voted for C, so > can vote for each other. > > Either one of the fixes above breaks the scenario. With them, B > wouldn't send a pre-vote on a mismatching term and C wouldn't treat > it as an actual vote. > > Adding a test that reproduces it, to have a better coverage, as we > thought that pre-vote replies during the actual vote should not be > possible. They still should not, but only since we got the other > fix in place. > > Additionally, step 10 can also be improved in the future by actually > comparing the log length on a pre-vote (not in this patch). > > The described scenario actually happened in the ovn-kubernetes CI, > as it used an older OVS 3.4.1 Fedora package that doesn't have the > aforementioned fixes. > > Signed-off-by: Ilya Maximets <[email protected]> > --- > tests/ovsdb-cluster.at | 89 ++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 89 insertions(+) > > diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at > index ff18ec556..7e3eef8d4 100644 > --- a/tests/ovsdb-cluster.at > +++ b/tests/ovsdb-cluster.at > @@ -1023,6 +1023,95 @@ done > > AT_CLEANUP > > +AT_SETUP([OVSDB cluster - disruptive server with the old term and outdated log]) > +AT_KEYWORDS([ovsdb server negative unix cluster disruptive]) > + > +n=3 > +cp $top_srcdir/vswitchd/vswitch.ovsschema schema > +AT_CHECK([ovsdb-tool '-vPATTERN:console:%c|%p|%m' create-cluster \ > + --election-timer=2000 s1.db schema unix:s1.raft], [0], [], [stderr]) > +cid=$(ovsdb-tool db-cid s1.db) > +schema_name=$(ovsdb-tool schema-name schema) > +for i in $(seq 2 $n); do > + AT_CHECK([ovsdb-tool join-cluster s$i.db $schema_name unix:s$i.raft unix:s1.raft]) > +done > + > +on_exit 'kill $(cat *.pid)' > +for i in $(seq $n); do > + AT_CHECK([ovsdb-server -v -vconsole:off -vsyslog:off --detach --no-chdir \ > + --log-file=s$i.log --pidfile=s$i.pid --unixctl=s$i \ > + --remote=punix:s$i.ovsdb s$i.db]) > +done > +for i in $(seq $n); do > + AT_CHECK([ovsdb_client_wait unix:s$i.ovsdb $schema_name connected]) > +done > + > +# Create a few transactions. > +AT_CHECK([ovs-vsctl --db=unix:s1.ovsdb --no-wait init], [0], [ignore], [ignore]) > +AT_CHECK([ovs-vsctl --db=unix:s1.ovsdb --no-wait create QoS type=x], [0], [ignore], [ignore]) > + > +# Stop the s3. > +OVS_APP_EXIT_AND_WAIT_BY_TARGET([$(pwd)/s3], [s3.pid]) > + > +# Commit more transactions to s1 and s2. > +AT_CHECK([ovs-vsctl --db=unix:s1.ovsdb --no-wait create QoS type=y], [0], [ignore], [ignore]) > +AT_CHECK([ovs-vsctl --db=unix:s1.ovsdb --no-wait create QoS type=z], [0], [ignore], [ignore]) > + > +# Stop the leader - s1. > +AT_CHECK([ovs-appctl -t $(pwd)/s1 cluster/status $schema_name | grep -q "Role: leader"]) > +AT_CHECK([ovs-appctl -t $(pwd)/s1 cluster/status $schema_name | grep -q "Term: 2"]) > +OVS_APP_EXIT_AND_WAIT_BY_TARGET([$(pwd)/s1], [s1.pid]) > + > +# Wait for s2 to start election and increase the term. > +OVS_WAIT_UNTIL([ovs-appctl -t $(pwd)/s2 cluster/status $schema_name | grep "Role: candidate"]) > +AT_CHECK([ovs-appctl -t $(pwd)/s2 cluster/status $schema_name | grep -q "Term: 3"]) > + > +# Stop the s2 as well. > +OVS_APP_EXIT_AND_WAIT_BY_TARGET([$(pwd)/s2], [s2.pid]) > + > +# Now we have all the servers down with terms 2, 3, 2, and the s3 behind on the > +# log by two transactions. Let's bring them back up. > +for i in $(seq $n); do > + AT_CHECK([ovsdb-server -v -vconsole:off -vsyslog:off --detach --no-chdir \ > + --log-file=s$i.log --pidfile=s$i.pid --unixctl=s$i \ > + --remote=punix:s$i.ovsdb s$i.db]) > +done > + > +# Delay elections on s1 and s2 giving s3 time to try a few times. > +for i in $(seq 1 10); do > + AT_CHECK([ovs-appctl -t $(pwd)/s1 cluster/failure-test delay-election], [0], [ignore]) > + AT_CHECK([ovs-appctl -t $(pwd)/s2 cluster/failure-test delay-election], [0], [ignore]) > + # Election timer is 2 seconds, so delaying every second is enough to keep > + # s1 and s2 from starting elections. > + sleep 1 > +done > + > +for i in $(seq $n); do > + AT_CHECK([ovsdb_client_wait unix:s$i.ovsdb $schema_name connected]) > +done > + > +# s3 was behind on the log, it must not be a leader. > +AT_CHECK([ovs-appctl -t $(pwd)/s3 cluster/status $schema_name | grep -q "Role: follower"]) > + > +# Add some more data. > +AT_CHECK([ovs-vsctl --db=unix:s1.ovsdb --no-leader-only --no-wait create QoS type=l], [0], [ignore], [ignore]) > +AT_CHECK([ovs-vsctl --db=unix:s1.ovsdb --no-leader-only --no-wait create QoS type=m], [0], [ignore], [ignore]) > + > +# Check that all the servers are in the cluster and have all the data. > +for i in $(seq $n); do > + AT_CHECK([ovs-appctl -t $(pwd)/s$i cluster/status $schema_name \ > + | grep -qE "Role: (leader|follower)"]) > + AT_CHECK([ovs-vsctl --db=unix:s$i.ovsdb --no-leader-only --bare \ > + --columns=type list QoS | grep . | sort | tr -d '\n'], > + [0], [lmxyz]) > +done > + > +for i in $(seq $n); do > + OVS_APP_EXIT_AND_WAIT_BY_TARGET([$(pwd)/s$i], [s$i.pid]) > +done > + > +AT_CLEANUP > + > > AT_BANNER([OVSDB - cluster tests]) > > -- > 2.52.0
Thanks Ilya for root-causing such a complex scenario and for the comprehensive commit message! Acked-by: Han Zhou <[email protected]> _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
