We recently fixed a couple issues with actually ignoring a disruptive
server as well as ignoring pre-vote replies during the actual vote:
c34c21bb0184 ("ovsdb: raft: Actually suppress the disruptive server.")
5f12cd410acf ("ovsdb: raft: Discard pre-vote replies during the actual
election.")
Without both of these fixes, the following scenario was possible:
1. A cluster with 3 servers: A (leader), B and C. Term: X.
2. C goes down.
3. A and B commit extra data --> C now has an outdated log.
4. A transfers leadership and goes down.
5. B initiates the election increasing the term to X+1 (no pre-vote).
6. B goes down.
7. Now the whole cluster is down with database files containing terms
X, X+1, and X, accordingly. Log on C is behind.
8. All servers go back up.
9. C initiates pre-vote on term X.
10. A sends pre-vote for C on term X (we do not compare the log yet).
11. B sends pre-vote for C on term X+1, because in the absence of
commit c34c21bb0184, we send a reply even if the term doesn't
match as long as request is considered disruptive.
12. C receives pre-vote for itself from A on term X.
13. C now has 2 out of 3 pre-votes (self and A).
14. C immediately initiates the actual vote on term X+1.
15. C receives pre-vote for itself from B on term X+1.
16. In the absence of commit 5f12cd410acf, C treats the pre-vote
from B as an actual vote.
17. C now thinks that it has 2 out of 3 actual votes and declares
itself a leader for term X+1.
18. A doesn't send an actual vote, because C has outdated log.
19. B sends an actual vote reply voting for itself, because it
already voted on term X+1 for itself at step 5.
20. C, as a leader, ignores the extra vote from B.
21. C sends append requests to A and B with an outdated log.
22. A and B acknowledge a new leader.
23. A and B attempt to truncate their logs below the commit index.
24. A and B crash on assertion failure and can't recover, because
the illegal truncation is part of their logs now.
In this situation it may also be possible to have two leaders
elected at the same time, in case A or B elect themselves before
C sends the first append request, as they never voted for C, so
can vote for each other.
Either one of the fixes above breaks the scenario. With them, B
wouldn't send a pre-vote on a mismatching term and C wouldn't treat
it as an actual vote.
Adding a test that reproduces it, to have a better coverage, as we
thought that pre-vote replies during the actual vote should not be
possible. They still should not, but only since we got the other
fix in place.
Additionally, step 10 can also be improved in the future by actually
comparing the log length on a pre-vote (not in this patch).
The described scenario actually happened in the ovn-kubernetes CI,
as it used an older OVS 3.4.1 Fedora package that doesn't have the
aforementioned fixes.
Signed-off-by: Ilya Maximets <[email protected]>
---
tests/ovsdb-cluster.at | 89 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 89 insertions(+)
diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at
index ff18ec556..7e3eef8d4 100644
--- a/tests/ovsdb-cluster.at
+++ b/tests/ovsdb-cluster.at
@@ -1023,6 +1023,95 @@ done
AT_CLEANUP
+AT_SETUP([OVSDB cluster - disruptive server with the old term and outdated
log])
+AT_KEYWORDS([ovsdb server negative unix cluster disruptive])
+
+n=3
+cp $top_srcdir/vswitchd/vswitch.ovsschema schema
+AT_CHECK([ovsdb-tool '-vPATTERN:console:%c|%p|%m' create-cluster \
+ --election-timer=2000 s1.db schema unix:s1.raft], [0], [],
[stderr])
+cid=$(ovsdb-tool db-cid s1.db)
+schema_name=$(ovsdb-tool schema-name schema)
+for i in $(seq 2 $n); do
+ AT_CHECK([ovsdb-tool join-cluster s$i.db $schema_name unix:s$i.raft
unix:s1.raft])
+done
+
+on_exit 'kill $(cat *.pid)'
+for i in $(seq $n); do
+ AT_CHECK([ovsdb-server -v -vconsole:off -vsyslog:off --detach --no-chdir \
+ --log-file=s$i.log --pidfile=s$i.pid --unixctl=s$i \
+ --remote=punix:s$i.ovsdb s$i.db])
+done
+for i in $(seq $n); do
+ AT_CHECK([ovsdb_client_wait unix:s$i.ovsdb $schema_name connected])
+done
+
+# Create a few transactions.
+AT_CHECK([ovs-vsctl --db=unix:s1.ovsdb --no-wait init], [0], [ignore],
[ignore])
+AT_CHECK([ovs-vsctl --db=unix:s1.ovsdb --no-wait create QoS type=x], [0],
[ignore], [ignore])
+
+# Stop the s3.
+OVS_APP_EXIT_AND_WAIT_BY_TARGET([$(pwd)/s3], [s3.pid])
+
+# Commit more transactions to s1 and s2.
+AT_CHECK([ovs-vsctl --db=unix:s1.ovsdb --no-wait create QoS type=y], [0],
[ignore], [ignore])
+AT_CHECK([ovs-vsctl --db=unix:s1.ovsdb --no-wait create QoS type=z], [0],
[ignore], [ignore])
+
+# Stop the leader - s1.
+AT_CHECK([ovs-appctl -t $(pwd)/s1 cluster/status $schema_name | grep -q "Role:
leader"])
+AT_CHECK([ovs-appctl -t $(pwd)/s1 cluster/status $schema_name | grep -q "Term:
2"])
+OVS_APP_EXIT_AND_WAIT_BY_TARGET([$(pwd)/s1], [s1.pid])
+
+# Wait for s2 to start election and increase the term.
+OVS_WAIT_UNTIL([ovs-appctl -t $(pwd)/s2 cluster/status $schema_name | grep
"Role: candidate"])
+AT_CHECK([ovs-appctl -t $(pwd)/s2 cluster/status $schema_name | grep -q "Term:
3"])
+
+# Stop the s2 as well.
+OVS_APP_EXIT_AND_WAIT_BY_TARGET([$(pwd)/s2], [s2.pid])
+
+# Now we have all the servers down with terms 2, 3, 2, and the s3 behind on the
+# log by two transactions. Let's bring them back up.
+for i in $(seq $n); do
+ AT_CHECK([ovsdb-server -v -vconsole:off -vsyslog:off --detach --no-chdir \
+ --log-file=s$i.log --pidfile=s$i.pid --unixctl=s$i \
+ --remote=punix:s$i.ovsdb s$i.db])
+done
+
+# Delay elections on s1 and s2 giving s3 time to try a few times.
+for i in $(seq 1 10); do
+ AT_CHECK([ovs-appctl -t $(pwd)/s1 cluster/failure-test delay-election],
[0], [ignore])
+ AT_CHECK([ovs-appctl -t $(pwd)/s2 cluster/failure-test delay-election],
[0], [ignore])
+ # Election timer is 2 seconds, so delaying every second is enough to keep
+ # s1 and s2 from starting elections.
+ sleep 1
+done
+
+for i in $(seq $n); do
+ AT_CHECK([ovsdb_client_wait unix:s$i.ovsdb $schema_name connected])
+done
+
+# s3 was behind on the log, it must not be a leader.
+AT_CHECK([ovs-appctl -t $(pwd)/s3 cluster/status $schema_name | grep -q "Role:
follower"])
+
+# Add some more data.
+AT_CHECK([ovs-vsctl --db=unix:s1.ovsdb --no-leader-only --no-wait create QoS
type=l], [0], [ignore], [ignore])
+AT_CHECK([ovs-vsctl --db=unix:s1.ovsdb --no-leader-only --no-wait create QoS
type=m], [0], [ignore], [ignore])
+
+# Check that all the servers are in the cluster and have all the data.
+for i in $(seq $n); do
+ AT_CHECK([ovs-appctl -t $(pwd)/s$i cluster/status $schema_name \
+ | grep -qE "Role: (leader|follower)"])
+ AT_CHECK([ovs-vsctl --db=unix:s$i.ovsdb --no-leader-only --bare \
+ --columns=type list QoS | grep . | sort | tr -d '\n'],
+ [0], [lmxyz])
+done
+
+for i in $(seq $n); do
+ OVS_APP_EXIT_AND_WAIT_BY_TARGET([$(pwd)/s$i], [s$i.pid])
+done
+
+AT_CLEANUP
+
AT_BANNER([OVSDB - cluster tests])
--
2.52.0
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev