[3/4] kudu git commit: [raft_consensus-itest] fix flake in Test_KUDU_1735

mpercy Mon, 26 Nov 2018 13:32:01 -0800

[raft_consensus-itest] fix flake in Test_KUDU_1735

Fixed flake in RaftConsensusParamReplicationModesITest.Test_KUDU_1735
scenario of the raft_consensus-itest suite, for both the 3-4-3 and 3-2-3
replica management schemes.


The flakiness was due to a few reasons, where the most prominent one was
the race in 3-4-3 mode between shutting down replica's consensus
upon DeleteTablet(TABLET_DATA_TOMBSTONED) and re-discovering the peer
by the leader once the catalog manager added the replica back as a
non-voter.  Another minor flake was due to not waiting for the manually
elected leader to commit an operation on its own term.  And the one
was due to waiting for RECEIVED_OPID instead of COMMITTED_OPID Raft
operation index before setting the --fault_crash_before_append_commit
flag to 1.

I ran the RaftConsensusParamReplicationModesITest.Test_KUDU_1735
scenario using dist_test with --stress_cpu_threads=16 before and after
the fix with the following results:

  before the fix (950 out of 1024 failed):
    http://dist-test.cloudera.org/job?job_id=aserbin.1542868617.80411

  after the fix  (  0 out of 1024 failed):
    http://dist-test.cloudera.org/job?job_id=aserbin.1542871109.113847

Change-Id: If44ad0e8363a3aead409484cff68843f1e5d6b6d
Reviewed-on: http://gerrit.cloudera.org:8080/11981
Tested-by: Kudu Jenkins
Reviewed-by: Andrew Wong <[email protected]>


Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/b2097e4b
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/b2097e4b
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/b2097e4b

Branch: refs/heads/master
Commit: b2097e4b9e2eb3cfca6435b1ec05971215494d36
Parents: f65211b
Author: Alexey Serbin <[email protected]>
Authored: Wed Nov 21 21:48:07 2018 -0800
Committer: Alexey Serbin <[email protected]>
Committed: Mon Nov 26 20:29:01 2018 +0000

----------------------------------------------------------------------
 .../integration-tests/cluster_itest_util.cc     |  16 +--
 src/kudu/integration-tests/cluster_itest_util.h |  16 ++-
 .../integration-tests/raft_consensus-itest.cc   | 109 ++++++++++++-------
 3 files changed, 88 insertions(+), 53 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kudu/blob/b2097e4b/src/kudu/integration-tests/cluster_itest_util.cc
----------------------------------------------------------------------
diff --git a/src/kudu/integration-tests/cluster_itest_util.cc 
b/src/kudu/integration-tests/cluster_itest_util.cc
index 186cf01..9dae097 100644
--- a/src/kudu/integration-tests/cluster_itest_util.cc
+++ b/src/kudu/integration-tests/cluster_itest_util.cc
@@ -212,15 +212,16 @@ Status WaitForOpFromCurrentTerm(TServerDetails* replica,
 Status WaitForServersToAgree(const MonoDelta& timeout,
                              const TabletServerMap& tablet_servers,
                              const string& tablet_id,
-                             int64_t minimum_index) {
+                             int64_t minimum_index,
+                             consensus::OpIdType op_id_type) {
   const MonoTime deadline = MonoTime::Now() + timeout;
 
+  vector<TServerDetails*> servers;
+  AppendValuesFromMap(tablet_servers, &servers);
   for (int i = 1; MonoTime::Now() < deadline; i++) {
-    vector<TServerDetails*> servers;
-    AppendValuesFromMap(tablet_servers, &servers);
     vector<OpId> ids;
-    Status s = GetLastOpIdForEachReplica(tablet_id, servers, 
consensus::RECEIVED_OPID, timeout,
-                                         &ids);
+    Status s = GetLastOpIdForEachReplica(
+        tablet_id, servers, op_id_type, timeout, &ids);
     if (s.ok()) {
       bool any_behind = false;
       bool any_disagree = false;
@@ -248,8 +249,9 @@ Status WaitForServersToAgree(const MonoDelta& timeout,
     LOG(INFO) << "Not converged past " << minimum_index << " yet: " << ids;
     SleepFor(MonoDelta::FromMilliseconds(min(i * 100, 1000)));
   }
-  return Status::TimedOut(Substitute("Index $0 not available on all replicas 
after $1. ",
-                                              minimum_index, 
timeout.ToString()));
+  return Status::TimedOut(
+      Substitute("index $0 not available on all replicas after $1",
+                 minimum_index, timeout.ToString()));
 }
 
 // Wait until all specified replicas have logged the given index.

http://git-wip-us.apache.org/repos/asf/kudu/blob/b2097e4b/src/kudu/integration-tests/cluster_itest_util.h
----------------------------------------------------------------------
diff --git a/src/kudu/integration-tests/cluster_itest_util.h 
b/src/kudu/integration-tests/cluster_itest_util.h
index 1768880..f9a073e 100644
--- a/src/kudu/integration-tests/cluster_itest_util.h
+++ b/src/kudu/integration-tests/cluster_itest_util.h
@@ -125,15 +125,19 @@ Status WaitForOpFromCurrentTerm(TServerDetails* replica,
                                 const MonoDelta& timeout,
                                 consensus::OpId* opid = nullptr);
 
-// Wait until all of the servers have converged on the same log index.
-// The converged index must be at least equal to 'minimum_index'.
+// Wait until all of the servers have converged on the same log index. The
+// converged index must be at least equal to 'minimum_index'. The 'op_id_type'
+// parameter is specify which kind of operations to watch for while tracking
+// their indexes.
 //
 // Requires that all servers are running. Returns Status::TimedOut if the
 // indexes do not converge within the given timeout.
-Status WaitForServersToAgree(const MonoDelta& timeout,
-                             const TabletServerMap& tablet_servers,
-                             const std::string& tablet_id,
-                             int64_t minimum_index);
+Status WaitForServersToAgree(
+    const MonoDelta& timeout,
+    const TabletServerMap& tablet_servers,
+    const std::string& tablet_id,
+    int64_t minimum_index,
+    consensus::OpIdType op_id_type = consensus::RECEIVED_OPID);
 
 // Wait until all specified replicas have logged at least the given index.
 // Unlike WaitForServersToAgree(), the servers do not actually have to converge

http://git-wip-us.apache.org/repos/asf/kudu/blob/b2097e4b/src/kudu/integration-tests/raft_consensus-itest.cc
----------------------------------------------------------------------
diff --git a/src/kudu/integration-tests/raft_consensus-itest.cc 
b/src/kudu/integration-tests/raft_consensus-itest.cc
index 29adabc..6593f2b 100644
--- a/src/kudu/integration-tests/raft_consensus-itest.cc
+++ b/src/kudu/integration-tests/raft_consensus-itest.cc
@@ -2445,8 +2445,8 @@ INSTANTIATE_TEST_CASE_P(, 
RaftConsensusParamReplicationModesITest,
                         ::testing::Bool());
 
 // Regression test for KUDU-1735, a crash in the case where a pending
-// config change operation is aborted during tablet deletion when that config 
change
-// was in fact already persisted to disk.
+// config change operation is aborted during tablet deletion when that config
+// change was in fact already persisted to disk.
 TEST_P(RaftConsensusParamReplicationModesITest, Test_KUDU_1735) {
   const MonoDelta kTimeout = MonoDelta::FromSeconds(10);
   const bool is_3_4_3 = GetParam();
@@ -2465,6 +2465,7 @@ TEST_P(RaftConsensusParamReplicationModesITest, 
Test_KUDU_1735) {
 
   NO_FATALS(BuildAndStart(kTsFlags, kMasterFlags));
 
+  ASSERT_EQ(3, tablet_servers_.size());
   TServerDetails* leader_tserver =
       tablet_servers_[cluster_->tablet_server(0)->uuid()];
   TServerDetails* evicted_tserver =
@@ -2473,9 +2474,18 @@ TEST_P(RaftConsensusParamReplicationModesITest, 
Test_KUDU_1735) {
   ASSERT_OK(StartElection(leader_tserver, tablet_id_, kTimeout));
   ASSERT_OK(WaitUntilLeader(leader_tserver, tablet_id_, kTimeout));
 
-  // Wait for log index 1 to propagate to all involved tablet servers.
-  ASSERT_OK(WaitForServersToAgree(MonoDelta::FromSeconds(10), tablet_servers_,
-                                  tablet_id_, 1));
+  // Wait for at least one operation to be committed in current term by the
+  // leader replica. Attempting to perform a config change (RemoveServer()
+  // in this scenario) prior to committing at least one operation in current
+  // term leads to an error since Kudu's Raft implementation enforces that
+  // invariant.
+  consensus::OpId opid;
+  ASSERT_OK(WaitForOpFromCurrentTerm(
+      leader_tserver, tablet_id_, consensus::COMMITTED_OPID, kTimeout, &opid));
+  // Wait for the committed index to propagate to all involved tablet servers.
+  ASSERT_OK(WaitForServersToAgree(
+      kTimeout, tablet_servers_, tablet_id_, opid.index(),
+      consensus::COMMITTED_OPID));
 
   // Make follower tablet servers crash before writing a commit message.
   for (const auto& e : tablet_servers_) {
@@ -2487,9 +2497,9 @@ TEST_P(RaftConsensusParamReplicationModesITest, 
Test_KUDU_1735) {
     ASSERT_OK(cluster_->SetFlag(ts, "fault_crash_before_append_commit", 
"1.0"));
   }
 
-  // Run a config change. This will cause the other servers to crash with 
pending config
-  // change operations due to the above fault injection.
-  auto affected_servers = 0;
+  // Run a config change. This will cause the other servers to crash with
+  // pending config change operations due to the above fault injection.
+  auto num_crashed_servers = 0;
   ASSERT_OK(RemoveServer(leader_tserver, tablet_id_, evicted_tserver, 
kTimeout));
   for (const auto& e : tablet_servers_) {
     const auto& server_uuid = e.first;
@@ -2497,45 +2507,64 @@ TEST_P(RaftConsensusParamReplicationModesITest, 
Test_KUDU_1735) {
       // The tablet server with the leader replica will not crash.
       continue;
     }
-    if (server_uuid == evicted_tserver->uuid() && is_3_4_3) {
-      // The removed server with follower replica will not receive any
-      // eviction-related configuration requests since it's not a part of the
-      // resulting Raft configuration. However, the server will receive config
-      // change request when a new voter replica is added to replace the 
removed
-      // one, and the behavior is different between 3-2-3 and 3-4-3 schemes.
-      //
-      // In case of the 3-2-3 scheme, the gone-and-back server will crash while
-      // trying to add the COMMIT entry into the WAL.  That change corresponds
-      // to the transaction which adds the server as a voting member of the
-      // resulting Raft configuration. Two out of three voting replicas are
-      // present, so that transaction is committed by the leader.
-      //
-      // In case of the 3-4-3 scheme, the replica is added back as a non-voter.
-      // Only one (out of two required) voting replica is available:
-      // the other replica crashes on an attempt to add a log entry about
-      // committing the configuration change on this server's eviction.
-      // So, since the configuration change on adding the server back as
-      // a non-voter replica cannot be committed, there isn't an attempt to add
-      // a commit entry into this server's WAL, so it should not crash.
+    // One of the remaining followers will crash while trying to commit the
+    // config change corresponding to the eviction of the other follower.
+    //
+    // As for the other (gone-and-back) follower, the behavior depends on
+    // the replica management scheme, being a bit more subtle in the case of
+    // the 3-4-3 scheme.
+    //
+    // In case of the 3-2-3 scheme, the gone-and-back server will crash while
+    // trying to add the COMMIT entry into the WAL. That change corresponds
+    // to the transaction which adds the server as a voting member of the
+    // resulting Raft configuration.
+    //
+    // In the 3-4-3 case, the catalog manager sends DeleteTablet() and ADD_PEER
+    // Raft configuration change requests upon detecting a change in the
+    // tablet's configuration reported by leader replica (the latter request is
+    // to keep the target replication factor). Those requests are scheduled
+    // to be sent asynchronously around the same time. Due to the concurrency 
of
+    // sending and processing those two requests, there are two possible
+    // outcomes:
+    //   1. If the tablet server completes processing the DeleteTablet() 
request
+    //      before receiving the ADD_PEER config change request, the ADD_PEER
+    //      config change to add a new non-voter will remain pending since
+    //      only one voter (the leader replica of the tablet) is alive. Since
+    //      the configuration change cannot be committed, no commit message is
+    //      written to the WAL in this case. So, the tablet server will not hit
+    //      the injected crash.
+    //   2. If the DeleteTablet() request is delayed and the leader replica
+    //      re-discovers the evicted replica as a new peer with LMP_MISMATCH
+    //      status after processing the ADD_PEER configuration change,
+    //      the leader replica will send the missing updates to the lingering
+    //      one. The lingering replica will try to replay the first missing
+    //      update (REMOVE_PEER) and will crash upon adding corresponding
+    //      record into its WAL.
+    auto* ts = cluster_->tablet_server_by_uuid(server_uuid);
+    auto s = ts->WaitForInjectedCrash(MonoDelta::FromSeconds(5));
+    if (server_uuid == evicted_tserver->uuid() && is_3_4_3 && !s.ok()) {
+      ASSERT_TRUE(s.IsTimedOut());
       continue;
     }
-
-    // The remaining follower will crash while trying to commit the config
-    // change evicting the other follower.
-    auto* ts = cluster_->tablet_server_by_uuid(server_uuid);
-    ASSERT_OK(ts->WaitForInjectedCrash(kTimeout));
-    ++affected_servers;
+    ASSERT_TRUE(s.ok()) << s.ToString();
+    ++num_crashed_servers;
   }
-  ASSERT_GE(affected_servers, 1);
-
-  if (is_3_4_3) {
+  if (!is_3_4_3) {
+    ASSERT_EQ(2, num_crashed_servers);
+  } else {
+    ASSERT_GE(num_crashed_servers, 1);
+    ASSERT_LE(num_crashed_servers, 2);
+    vector<decltype(leader_tserver)> servers = { leader_tserver };
+    if (num_crashed_servers == 1) {
+      servers.push_back(evicted_tserver);
+    }
     // In case of the 3-4-3 scheme, make sure the configuration change to
     // add the removed server back as a non-voter hasn't been committed.
     ASSERT_EVENTUALLY([&] {
-      for (auto* srv : { leader_tserver, evicted_tserver }) {
+      for (const auto* server : servers) {
         consensus::ConsensusStatePB cstate;
-        ASSERT_OK(itest::GetConsensusState(srv, tablet_id_, kTimeout, 
EXCLUDE_HEALTH_REPORT,
-                                           &cstate));
+        ASSERT_OK(itest::GetConsensusState(
+            server, tablet_id_, kTimeout, EXCLUDE_HEALTH_REPORT, &cstate));
         ASSERT_TRUE(cstate.has_pending_config());
       }
     });

[3/4] kudu git commit: [raft_consensus-itest] fix flake in Test_KUDU_1735

Reply via email to