On Fri, Jan 30, 2026 at 5:34 PM Ilya Maximets <[email protected]> wrote:
>
> We recently fixed a couple issues with actually ignoring a disruptive
> server as well as ignoring pre-vote replies during the actual vote:
>
>  c34c21bb0184 ("ovsdb: raft: Actually suppress the disruptive server.")
>  5f12cd410acf ("ovsdb: raft: Discard pre-vote replies during the actual
election.")
>
> Without both of these fixes, the following scenario was possible:
>
>   1. A cluster with 3 servers: A (leader), B and C.  Term: X.
>   2. C goes down.
>   3. A and B commit extra data --> C now has an outdated log.
>   4. A transfers leadership and goes down.
>   5. B initiates the election increasing the term to X+1 (no pre-vote).
>   6. B goes down.
>   7. Now the whole cluster is down with database files containing terms
>      X, X+1, and X, accordingly.  Log on C is behind.
>   8. All servers go back up.
>   9. C initiates pre-vote on term X.
>  10. A sends pre-vote for C on term X (we do not compare the log yet).
>  11. B sends pre-vote for C on term X+1, because in the absence of
>      commit c34c21bb0184, we send a reply even if the term doesn't
>      match as long as request is considered disruptive.
>  12. C receives pre-vote for itself from A on term X.
>  13. C now has 2 out of 3 pre-votes (self and A).
>  14. C immediately initiates the actual vote on term X+1.
>  15. C receives pre-vote for itself from B on term X+1.
>  16. In the absence of commit 5f12cd410acf, C treats the pre-vote
>      from B as an actual vote.
>  17. C now thinks that it has 2 out of 3 actual votes and declares
>      itself a leader for term X+1.
>  18. A doesn't send an actual vote, because C has outdated log.
>  19. B sends an actual vote reply voting for itself, because it
>      already voted on term X+1 for itself at step 5.
>  20. C, as a leader, ignores the extra vote from B.
>  21. C sends append requests to A and B with an outdated log.
>  22. A and B acknowledge a new leader.
>  23. A and B attempt to truncate their logs below the commit index.
>  24. A and B crash on assertion failure and can't recover, because
>      the illegal truncation is part of their logs now.
>
> In this situation it may also be possible to have two leaders
> elected at the same time, in case A or B elect themselves before
> C sends the first append request, as they never voted for C, so
> can vote for each other.
>
> Either one of the fixes above breaks the scenario.  With them, B
> wouldn't send a pre-vote on a mismatching term and C wouldn't treat
> it as an actual vote.
>
> Adding a test that reproduces it, to have a better coverage, as we
> thought that pre-vote replies during the actual vote should not be
> possible.  They still should not, but only since we got the other
> fix in place.
>
> Additionally, step 10 can also be improved in the future by actually
> comparing the log length on a pre-vote (not in this patch).
>
> The described scenario actually happened in the ovn-kubernetes CI,
> as it used an older OVS 3.4.1 Fedora package that doesn't have the
> aforementioned fixes.
>
> Signed-off-by: Ilya Maximets <[email protected]>
> ---
>  tests/ovsdb-cluster.at | 89 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 89 insertions(+)
>
> diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at
> index ff18ec556..7e3eef8d4 100644
> --- a/tests/ovsdb-cluster.at
> +++ b/tests/ovsdb-cluster.at
> @@ -1023,6 +1023,95 @@ done
>
>  AT_CLEANUP
>
> +AT_SETUP([OVSDB cluster - disruptive server with the old term and
outdated log])
> +AT_KEYWORDS([ovsdb server negative unix cluster disruptive])
> +
> +n=3
> +cp $top_srcdir/vswitchd/vswitch.ovsschema schema
> +AT_CHECK([ovsdb-tool '-vPATTERN:console:%c|%p|%m' create-cluster \
> +            --election-timer=2000 s1.db schema unix:s1.raft], [0], [],
[stderr])
> +cid=$(ovsdb-tool db-cid s1.db)
> +schema_name=$(ovsdb-tool schema-name schema)
> +for i in $(seq 2 $n); do
> +    AT_CHECK([ovsdb-tool join-cluster s$i.db $schema_name unix:s$i.raft
unix:s1.raft])
> +done
> +
> +on_exit 'kill $(cat *.pid)'
> +for i in $(seq $n); do
> +    AT_CHECK([ovsdb-server -v -vconsole:off -vsyslog:off --detach
--no-chdir \
> +                --log-file=s$i.log --pidfile=s$i.pid --unixctl=s$i \
> +                --remote=punix:s$i.ovsdb s$i.db])
> +done
> +for i in $(seq $n); do
> +    AT_CHECK([ovsdb_client_wait unix:s$i.ovsdb $schema_name connected])
> +done
> +
> +# Create a few transactions.
> +AT_CHECK([ovs-vsctl --db=unix:s1.ovsdb --no-wait init], [0], [ignore],
[ignore])
> +AT_CHECK([ovs-vsctl --db=unix:s1.ovsdb --no-wait create QoS type=x],
[0], [ignore], [ignore])
> +
> +# Stop the s3.
> +OVS_APP_EXIT_AND_WAIT_BY_TARGET([$(pwd)/s3], [s3.pid])
> +
> +# Commit more transactions to s1 and s2.
> +AT_CHECK([ovs-vsctl --db=unix:s1.ovsdb --no-wait create QoS type=y],
[0], [ignore], [ignore])
> +AT_CHECK([ovs-vsctl --db=unix:s1.ovsdb --no-wait create QoS type=z],
[0], [ignore], [ignore])
> +
> +# Stop the leader - s1.
> +AT_CHECK([ovs-appctl -t $(pwd)/s1 cluster/status $schema_name | grep -q
"Role: leader"])
> +AT_CHECK([ovs-appctl -t $(pwd)/s1 cluster/status $schema_name | grep -q
"Term: 2"])
> +OVS_APP_EXIT_AND_WAIT_BY_TARGET([$(pwd)/s1], [s1.pid])
> +
> +# Wait for s2 to start election and increase the term.
> +OVS_WAIT_UNTIL([ovs-appctl -t $(pwd)/s2 cluster/status $schema_name |
grep "Role: candidate"])
> +AT_CHECK([ovs-appctl -t $(pwd)/s2 cluster/status $schema_name | grep -q
"Term: 3"])
> +
> +# Stop the s2 as well.
> +OVS_APP_EXIT_AND_WAIT_BY_TARGET([$(pwd)/s2], [s2.pid])
> +
> +# Now we have all the servers down with terms 2, 3, 2, and the s3 behind
on the
> +# log by two transactions.  Let's bring them back up.
> +for i in $(seq $n); do
> +    AT_CHECK([ovsdb-server -v -vconsole:off -vsyslog:off --detach
--no-chdir \
> +                --log-file=s$i.log --pidfile=s$i.pid --unixctl=s$i \
> +                --remote=punix:s$i.ovsdb s$i.db])
> +done
> +
> +# Delay elections on s1 and s2 giving s3 time to try a few times.
> +for i in $(seq 1 10); do
> +    AT_CHECK([ovs-appctl -t $(pwd)/s1 cluster/failure-test
delay-election], [0], [ignore])
> +    AT_CHECK([ovs-appctl -t $(pwd)/s2 cluster/failure-test
delay-election], [0], [ignore])
> +    # Election timer is 2 seconds, so delaying every second is enough to
keep
> +    # s1 and s2 from starting elections.
> +    sleep 1
> +done
> +
> +for i in $(seq $n); do
> +    AT_CHECK([ovsdb_client_wait unix:s$i.ovsdb $schema_name connected])
> +done
> +
> +# s3 was behind on the log, it must not be a leader.
> +AT_CHECK([ovs-appctl -t $(pwd)/s3 cluster/status $schema_name | grep -q
"Role: follower"])
> +
> +# Add some more data.
> +AT_CHECK([ovs-vsctl --db=unix:s1.ovsdb --no-leader-only --no-wait create
QoS type=l], [0], [ignore], [ignore])
> +AT_CHECK([ovs-vsctl --db=unix:s1.ovsdb --no-leader-only --no-wait create
QoS type=m], [0], [ignore], [ignore])
> +
> +# Check that all the servers are in the cluster and have all the data.
> +for i in $(seq $n); do
> +    AT_CHECK([ovs-appctl -t $(pwd)/s$i cluster/status $schema_name \
> +                | grep -qE "Role: (leader|follower)"])
> +    AT_CHECK([ovs-vsctl --db=unix:s$i.ovsdb --no-leader-only --bare \
> +                --columns=type list QoS | grep . | sort | tr -d '\n'],
> +             [0], [lmxyz])
> +done
> +
> +for i in $(seq $n); do
> +    OVS_APP_EXIT_AND_WAIT_BY_TARGET([$(pwd)/s$i], [s$i.pid])
> +done
> +
> +AT_CLEANUP
> +
>
>  AT_BANNER([OVSDB - cluster tests])
>
> --
> 2.52.0

Thanks Ilya for root-causing such a complex scenario and for the
comprehensive commit message!
Acked-by: Han Zhou <[email protected]>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to