This is an automated email from the ASF dual-hosted git repository.
alexey pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/kudu.git
The following commit(s) were added to refs/heads/master by this push:
new 6c77ec875 KUDU-3641 fix flaky TestNewLeaderCantResolvePeers
6c77ec875 is described below
commit 6c77ec8752dce6c8253c980c71a25859a3b63f67
Author: Alexey Serbin <[email protected]>
AuthorDate: Fri Jan 24 16:00:22 2025 -0800
KUDU-3641 fix flaky TestNewLeaderCantResolvePeers
I noticed that RaftConsensusElectionITest.TestNewLeaderCantResolvePeers
scenario was failing from time to time in pre-commit tests, and the same
issue was also exposed by the flaky tests dashboard [1].
The scenario would usually succeed because in most cases the system
catalog was able to establish a tablet replica at the newly added tablet
server even before LeaderStepDown() had been called. Since the UUIDs
of the new and the old leader were the same for the LeaderStepDown()
invocation, the implementation was using the short-circuited path
(i.e. doing nothing) instead of starting an actual election round.
The scenario would fail if the tablet replica hadn't yet been placed
at the newly added server by the time of checking for its presence by
ListRunningTabletIds().
The fix is trivial: use StartElection() instead of LeaderStepDown().
To verify that this patch fixes the issue, I ran the following command
against DEBUG bits built with and without the patch at the same machine.
Without the patch, the scenario would fail once in ~150 runs.
With the patch, there hasn't been a single failure.
./bin/raft_consensus_election-itest \
--gtest_filter='*TestNewLeaderCantResolvePeers' \
--stress_cpu_threads=24 \
--gtest_repeat=1000
This is a follow-up to f9647149a49ddb87ea0ecf069eab3b5ec0217136.
[1]
http://dist-test.cloudera.org:8080/test_drilldown?test_name=raft_consensus_election-itest
Change-Id: I9f724fee15eec74c068ce0aecfd4544f99a46866
Reviewed-on: http://gerrit.cloudera.org:8080/22389
Tested-by: Kudu Jenkins
Reviewed-by: Yifan Zhang <[email protected]>
---
src/kudu/integration-tests/raft_consensus_election-itest.cc | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/kudu/integration-tests/raft_consensus_election-itest.cc
b/src/kudu/integration-tests/raft_consensus_election-itest.cc
index ce86a63ec..06478a0b3 100644
--- a/src/kudu/integration-tests/raft_consensus_election-itest.cc
+++ b/src/kudu/integration-tests/raft_consensus_election-itest.cc
@@ -272,9 +272,9 @@ TEST_F(RaftConsensusElectionITest,
TestNewLeaderCantResolvePeers) {
}
// Cause an election again to trigger a new report to the master. This time
// the master should place the replica since it has a new tserver available.
- ASSERT_OK(LeaderStepDown(
- second_ts, tablet_id, kTimeout, /*error=*/nullptr, second_ts->uuid()));
+ ASSERT_OK(StartElection(second_ts, tablet_id, kTimeout));
ASSERT_OK(WaitUntilLeader(second_ts, tablet_id, kTimeout));
+ NO_FATALS(cluster_->AssertNoCrashes());
STLDeleteValues(&tablet_servers_);
ASSERT_OK(itest::CreateTabletServerMap(cluster_->master_proxy(),