This is an automated email from the ASF dual-hosted git repository.

alexey pushed a commit to branch branch-1.18.x
in repository https://gitbox.apache.org/repos/asf/kudu.git


The following commit(s) were added to refs/heads/branch-1.18.x by this push:
     new 0c47a46e4 KUDU-3641 fix flaky TestNewLeaderCantResolvePeers
0c47a46e4 is described below

commit 0c47a46e41235020337984a6053d3b7e3964092b
Author: Alexey Serbin <[email protected]>
AuthorDate: Fri Jan 24 16:00:22 2025 -0800

    KUDU-3641 fix flaky TestNewLeaderCantResolvePeers
    
    I noticed that RaftConsensusElectionITest.TestNewLeaderCantResolvePeers
    scenario was failing from time to time in pre-commit tests, and the same
    issue was also exposed by the flaky tests dashboard [1].
    
    The scenario would usually succeed because in most cases the system
    catalog was able to establish a tablet replica at the newly added tablet
    server even before LeaderStepDown() had been called.  Since the UUIDs
    of the new and the old leader were the same for the LeaderStepDown()
    invocation, the implementation was using the short-circuited path
    (i.e. doing nothing) instead of starting an actual election round.
    The scenario would fail if the tablet replica hadn't yet been placed
    at the newly added server by the time of checking for its presence by
    ListRunningTabletIds().
    
    The fix is trivial: use StartElection() instead of LeaderStepDown().
    
    To verify that this patch fixes the issue, I ran the following command
    against DEBUG bits built with and without the patch at the same machine.
    Without the patch, the scenario would fail once in ~150 runs.
    With the patch, there hasn't been a single failure.
    
      ./bin/raft_consensus_election-itest \
        --gtest_filter='*TestNewLeaderCantResolvePeers' \
        --stress_cpu_threads=24 \
        --gtest_repeat=1000
    
    This is a follow-up to f9647149a49ddb87ea0ecf069eab3b5ec0217136.
    
    [1] 
http://dist-test.cloudera.org:8080/test_drilldown?test_name=raft_consensus_election-itest
    
    Change-Id: I9f724fee15eec74c068ce0aecfd4544f99a46866
    Reviewed-on: http://gerrit.cloudera.org:8080/22389
    Tested-by: Kudu Jenkins
    Reviewed-by: Yifan Zhang <[email protected]>
    (cherry picked from commit 6c77ec8752dce6c8253c980c71a25859a3b63f67)
    Reviewed-on: http://gerrit.cloudera.org:8080/22390
    Tested-by: Alexey Serbin <[email protected]>
---
 src/kudu/integration-tests/raft_consensus_election-itest.cc | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/kudu/integration-tests/raft_consensus_election-itest.cc 
b/src/kudu/integration-tests/raft_consensus_election-itest.cc
index ce86a63ec..06478a0b3 100644
--- a/src/kudu/integration-tests/raft_consensus_election-itest.cc
+++ b/src/kudu/integration-tests/raft_consensus_election-itest.cc
@@ -272,9 +272,9 @@ TEST_F(RaftConsensusElectionITest, 
TestNewLeaderCantResolvePeers) {
   }
   // Cause an election again to trigger a new report to the master. This time
   // the master should place the replica since it has a new tserver available.
-  ASSERT_OK(LeaderStepDown(
-      second_ts, tablet_id, kTimeout, /*error=*/nullptr, second_ts->uuid()));
+  ASSERT_OK(StartElection(second_ts, tablet_id, kTimeout));
   ASSERT_OK(WaitUntilLeader(second_ts, tablet_id, kTimeout));
+  NO_FATALS(cluster_->AssertNoCrashes());
 
   STLDeleteValues(&tablet_servers_);
   ASSERT_OK(itest::CreateTabletServerMap(cluster_->master_proxy(),

Reply via email to