[raft_consensus-itest] fix flake in Test_KUDU_1735
Fixed flake in RaftConsensusParamReplicationModesITest.Test_KUDU_1735
scenario of the raft_consensus-itest suite, for both the 3-4-3 and 3-2-3
replica management schemes.
The flakiness was due to a few reasons, where the most prominent one was
the race in 3-4-3 mode between shutting down replica's consensus
upon DeleteTablet(TABLET_DATA_TOMBSTONED) and re-discovering the peer
by the leader once the catalog manager added the replica back as a
non-voter. Another minor flake was due to not waiting for the manually
elected leader to commit an operation on its own term. And the one
was due to waiting for RECEIVED_OPID instead of COMMITTED_OPID Raft
operation index before setting the --fault_crash_before_append_commit
flag to 1.
I ran the RaftConsensusParamReplicationModesITest.Test_KUDU_1735
scenario using dist_test with --stress_cpu_threads=16 before and after
the fix with the following results:
before the fix (950 out of 1024 failed):
http://dist-test.cloudera.org/job?job_id=aserbin.1542868617.80411
after the fix ( 0 out of 1024 failed):
http://dist-test.cloudera.org/job?job_id=aserbin.1542871109.113847
Change-Id: If44ad0e8363a3aead409484cff68843f1e5d6b6d
Reviewed-on: http://gerrit.cloudera.org:8080/11981
Tested-by: Kudu Jenkins
Reviewed-by: Andrew Wong <[email protected]>
Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/b2097e4b
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/b2097e4b
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/b2097e4b
Branch: refs/heads/master
Commit: b2097e4b9e2eb3cfca6435b1ec05971215494d36
Parents: f65211b
Author: Alexey Serbin <[email protected]>
Authored: Wed Nov 21 21:48:07 2018 -0800
Committer: Alexey Serbin <[email protected]>
Committed: Mon Nov 26 20:29:01 2018 +0000
----------------------------------------------------------------------
.../integration-tests/cluster_itest_util.cc | 16 +--
src/kudu/integration-tests/cluster_itest_util.h | 16 ++-
.../integration-tests/raft_consensus-itest.cc | 109 ++++++++++++-------
3 files changed, 88 insertions(+), 53 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/kudu/blob/b2097e4b/src/kudu/integration-tests/cluster_itest_util.cc
----------------------------------------------------------------------
diff --git a/src/kudu/integration-tests/cluster_itest_util.cc
b/src/kudu/integration-tests/cluster_itest_util.cc
index 186cf01..9dae097 100644
--- a/src/kudu/integration-tests/cluster_itest_util.cc
+++ b/src/kudu/integration-tests/cluster_itest_util.cc
@@ -212,15 +212,16 @@ Status WaitForOpFromCurrentTerm(TServerDetails* replica,
Status WaitForServersToAgree(const MonoDelta& timeout,
const TabletServerMap& tablet_servers,
const string& tablet_id,
- int64_t minimum_index) {
+ int64_t minimum_index,
+ consensus::OpIdType op_id_type) {
const MonoTime deadline = MonoTime::Now() + timeout;
+ vector<TServerDetails*> servers;
+ AppendValuesFromMap(tablet_servers, &servers);
for (int i = 1; MonoTime::Now() < deadline; i++) {
- vector<TServerDetails*> servers;
- AppendValuesFromMap(tablet_servers, &servers);
vector<OpId> ids;
- Status s = GetLastOpIdForEachReplica(tablet_id, servers,
consensus::RECEIVED_OPID, timeout,
- &ids);
+ Status s = GetLastOpIdForEachReplica(
+ tablet_id, servers, op_id_type, timeout, &ids);
if (s.ok()) {
bool any_behind = false;
bool any_disagree = false;
@@ -248,8 +249,9 @@ Status WaitForServersToAgree(const MonoDelta& timeout,
LOG(INFO) << "Not converged past " << minimum_index << " yet: " << ids;
SleepFor(MonoDelta::FromMilliseconds(min(i * 100, 1000)));
}
- return Status::TimedOut(Substitute("Index $0 not available on all replicas
after $1. ",
- minimum_index,
timeout.ToString()));
+ return Status::TimedOut(
+ Substitute("index $0 not available on all replicas after $1",
+ minimum_index, timeout.ToString()));
}
// Wait until all specified replicas have logged the given index.
http://git-wip-us.apache.org/repos/asf/kudu/blob/b2097e4b/src/kudu/integration-tests/cluster_itest_util.h
----------------------------------------------------------------------
diff --git a/src/kudu/integration-tests/cluster_itest_util.h
b/src/kudu/integration-tests/cluster_itest_util.h
index 1768880..f9a073e 100644
--- a/src/kudu/integration-tests/cluster_itest_util.h
+++ b/src/kudu/integration-tests/cluster_itest_util.h
@@ -125,15 +125,19 @@ Status WaitForOpFromCurrentTerm(TServerDetails* replica,
const MonoDelta& timeout,
consensus::OpId* opid = nullptr);
-// Wait until all of the servers have converged on the same log index.
-// The converged index must be at least equal to 'minimum_index'.
+// Wait until all of the servers have converged on the same log index. The
+// converged index must be at least equal to 'minimum_index'. The 'op_id_type'
+// parameter is specify which kind of operations to watch for while tracking
+// their indexes.
//
// Requires that all servers are running. Returns Status::TimedOut if the
// indexes do not converge within the given timeout.
-Status WaitForServersToAgree(const MonoDelta& timeout,
- const TabletServerMap& tablet_servers,
- const std::string& tablet_id,
- int64_t minimum_index);
+Status WaitForServersToAgree(
+ const MonoDelta& timeout,
+ const TabletServerMap& tablet_servers,
+ const std::string& tablet_id,
+ int64_t minimum_index,
+ consensus::OpIdType op_id_type = consensus::RECEIVED_OPID);
// Wait until all specified replicas have logged at least the given index.
// Unlike WaitForServersToAgree(), the servers do not actually have to converge
http://git-wip-us.apache.org/repos/asf/kudu/blob/b2097e4b/src/kudu/integration-tests/raft_consensus-itest.cc
----------------------------------------------------------------------
diff --git a/src/kudu/integration-tests/raft_consensus-itest.cc
b/src/kudu/integration-tests/raft_consensus-itest.cc
index 29adabc..6593f2b 100644
--- a/src/kudu/integration-tests/raft_consensus-itest.cc
+++ b/src/kudu/integration-tests/raft_consensus-itest.cc
@@ -2445,8 +2445,8 @@ INSTANTIATE_TEST_CASE_P(,
RaftConsensusParamReplicationModesITest,
::testing::Bool());
// Regression test for KUDU-1735, a crash in the case where a pending
-// config change operation is aborted during tablet deletion when that config
change
-// was in fact already persisted to disk.
+// config change operation is aborted during tablet deletion when that config
+// change was in fact already persisted to disk.
TEST_P(RaftConsensusParamReplicationModesITest, Test_KUDU_1735) {
const MonoDelta kTimeout = MonoDelta::FromSeconds(10);
const bool is_3_4_3 = GetParam();
@@ -2465,6 +2465,7 @@ TEST_P(RaftConsensusParamReplicationModesITest,
Test_KUDU_1735) {
NO_FATALS(BuildAndStart(kTsFlags, kMasterFlags));
+ ASSERT_EQ(3, tablet_servers_.size());
TServerDetails* leader_tserver =
tablet_servers_[cluster_->tablet_server(0)->uuid()];
TServerDetails* evicted_tserver =
@@ -2473,9 +2474,18 @@ TEST_P(RaftConsensusParamReplicationModesITest,
Test_KUDU_1735) {
ASSERT_OK(StartElection(leader_tserver, tablet_id_, kTimeout));
ASSERT_OK(WaitUntilLeader(leader_tserver, tablet_id_, kTimeout));
- // Wait for log index 1 to propagate to all involved tablet servers.
- ASSERT_OK(WaitForServersToAgree(MonoDelta::FromSeconds(10), tablet_servers_,
- tablet_id_, 1));
+ // Wait for at least one operation to be committed in current term by the
+ // leader replica. Attempting to perform a config change (RemoveServer()
+ // in this scenario) prior to committing at least one operation in current
+ // term leads to an error since Kudu's Raft implementation enforces that
+ // invariant.
+ consensus::OpId opid;
+ ASSERT_OK(WaitForOpFromCurrentTerm(
+ leader_tserver, tablet_id_, consensus::COMMITTED_OPID, kTimeout, &opid));
+ // Wait for the committed index to propagate to all involved tablet servers.
+ ASSERT_OK(WaitForServersToAgree(
+ kTimeout, tablet_servers_, tablet_id_, opid.index(),
+ consensus::COMMITTED_OPID));
// Make follower tablet servers crash before writing a commit message.
for (const auto& e : tablet_servers_) {
@@ -2487,9 +2497,9 @@ TEST_P(RaftConsensusParamReplicationModesITest,
Test_KUDU_1735) {
ASSERT_OK(cluster_->SetFlag(ts, "fault_crash_before_append_commit",
"1.0"));
}
- // Run a config change. This will cause the other servers to crash with
pending config
- // change operations due to the above fault injection.
- auto affected_servers = 0;
+ // Run a config change. This will cause the other servers to crash with
+ // pending config change operations due to the above fault injection.
+ auto num_crashed_servers = 0;
ASSERT_OK(RemoveServer(leader_tserver, tablet_id_, evicted_tserver,
kTimeout));
for (const auto& e : tablet_servers_) {
const auto& server_uuid = e.first;
@@ -2497,45 +2507,64 @@ TEST_P(RaftConsensusParamReplicationModesITest,
Test_KUDU_1735) {
// The tablet server with the leader replica will not crash.
continue;
}
- if (server_uuid == evicted_tserver->uuid() && is_3_4_3) {
- // The removed server with follower replica will not receive any
- // eviction-related configuration requests since it's not a part of the
- // resulting Raft configuration. However, the server will receive config
- // change request when a new voter replica is added to replace the
removed
- // one, and the behavior is different between 3-2-3 and 3-4-3 schemes.
- //
- // In case of the 3-2-3 scheme, the gone-and-back server will crash while
- // trying to add the COMMIT entry into the WAL. That change corresponds
- // to the transaction which adds the server as a voting member of the
- // resulting Raft configuration. Two out of three voting replicas are
- // present, so that transaction is committed by the leader.
- //
- // In case of the 3-4-3 scheme, the replica is added back as a non-voter.
- // Only one (out of two required) voting replica is available:
- // the other replica crashes on an attempt to add a log entry about
- // committing the configuration change on this server's eviction.
- // So, since the configuration change on adding the server back as
- // a non-voter replica cannot be committed, there isn't an attempt to add
- // a commit entry into this server's WAL, so it should not crash.
+ // One of the remaining followers will crash while trying to commit the
+ // config change corresponding to the eviction of the other follower.
+ //
+ // As for the other (gone-and-back) follower, the behavior depends on
+ // the replica management scheme, being a bit more subtle in the case of
+ // the 3-4-3 scheme.
+ //
+ // In case of the 3-2-3 scheme, the gone-and-back server will crash while
+ // trying to add the COMMIT entry into the WAL. That change corresponds
+ // to the transaction which adds the server as a voting member of the
+ // resulting Raft configuration.
+ //
+ // In the 3-4-3 case, the catalog manager sends DeleteTablet() and ADD_PEER
+ // Raft configuration change requests upon detecting a change in the
+ // tablet's configuration reported by leader replica (the latter request is
+ // to keep the target replication factor). Those requests are scheduled
+ // to be sent asynchronously around the same time. Due to the concurrency
of
+ // sending and processing those two requests, there are two possible
+ // outcomes:
+ // 1. If the tablet server completes processing the DeleteTablet()
request
+ // before receiving the ADD_PEER config change request, the ADD_PEER
+ // config change to add a new non-voter will remain pending since
+ // only one voter (the leader replica of the tablet) is alive. Since
+ // the configuration change cannot be committed, no commit message is
+ // written to the WAL in this case. So, the tablet server will not hit
+ // the injected crash.
+ // 2. If the DeleteTablet() request is delayed and the leader replica
+ // re-discovers the evicted replica as a new peer with LMP_MISMATCH
+ // status after processing the ADD_PEER configuration change,
+ // the leader replica will send the missing updates to the lingering
+ // one. The lingering replica will try to replay the first missing
+ // update (REMOVE_PEER) and will crash upon adding corresponding
+ // record into its WAL.
+ auto* ts = cluster_->tablet_server_by_uuid(server_uuid);
+ auto s = ts->WaitForInjectedCrash(MonoDelta::FromSeconds(5));
+ if (server_uuid == evicted_tserver->uuid() && is_3_4_3 && !s.ok()) {
+ ASSERT_TRUE(s.IsTimedOut());
continue;
}
-
- // The remaining follower will crash while trying to commit the config
- // change evicting the other follower.
- auto* ts = cluster_->tablet_server_by_uuid(server_uuid);
- ASSERT_OK(ts->WaitForInjectedCrash(kTimeout));
- ++affected_servers;
+ ASSERT_TRUE(s.ok()) << s.ToString();
+ ++num_crashed_servers;
}
- ASSERT_GE(affected_servers, 1);
-
- if (is_3_4_3) {
+ if (!is_3_4_3) {
+ ASSERT_EQ(2, num_crashed_servers);
+ } else {
+ ASSERT_GE(num_crashed_servers, 1);
+ ASSERT_LE(num_crashed_servers, 2);
+ vector<decltype(leader_tserver)> servers = { leader_tserver };
+ if (num_crashed_servers == 1) {
+ servers.push_back(evicted_tserver);
+ }
// In case of the 3-4-3 scheme, make sure the configuration change to
// add the removed server back as a non-voter hasn't been committed.
ASSERT_EVENTUALLY([&] {
- for (auto* srv : { leader_tserver, evicted_tserver }) {
+ for (const auto* server : servers) {
consensus::ConsensusStatePB cstate;
- ASSERT_OK(itest::GetConsensusState(srv, tablet_id_, kTimeout,
EXCLUDE_HEALTH_REPORT,
- &cstate));
+ ASSERT_OK(itest::GetConsensusState(
+ server, tablet_id_, kTimeout, EXCLUDE_HEALTH_REPORT, &cstate));
ASSERT_TRUE(cstate.has_pending_config());
}
});