Alexey Serbin has uploaded a new change for review.
http://gerrit.cloudera.org:8080/7887
Change subject: WIP [raft_consensus-itest] fix flake in TestSlowLeader
......................................................................
WIP [raft_consensus-itest] fix flake in TestSlowLeader
Under rare conditions, a reader thread of the test workload of the
RaftConsensusITest.TestSlowLeader test might read data from a lagging
follower replica, while the writer threads just switched to a newly
elected leader replica. When that happened, the test failed with
a stack trace like the following:
F0829 01:47:51.052803 1605 test_workload.cc:230] Check failed: \
row_count >= expected_row_count (219550 vs. 249850)
*** Check failure stack trace: ***
@ 0x95e83d google::LogMessage::Fail() \
at thirdparty/src/glog-0.3.5/src/logging.cc:1488
@ 0x9606fd google::LogMessage::SendToLog() \
at thirdparty/src/glog-0.3.5/src/logging.cc:1442
@ 0x95e379 google::LogMessage::Flush() \
at thirdparty/src/glog-0.3.5/src/logging.cc:1312
@ 0x96119f google::LogMessageFatal::~LogMessageFatal() \
at thirdparty/src/glog-0.3.5/src/logging.cc:2024
@ 0x955809 kudu::TestWorkload::ReadThread() \
at tr1/shared_ptr.h:340
@ 0x7fbdbc108a40 (unknown) at ??:0
@ 0x7fbdbd550184 start_thread at ??:0
@ 0x7fbdbbb7637d clone at ??:0
@ (nil) (unknown)
This patch addresses the issue. Basically, it works around the RYW
consistency issue which is seen here because of the absense of the
leader leases mechanism.
To test the modifications, I run the test 2K times multiple times,
none of those failed (RELEASE build):
http://dist-test.cloudera.org//job?job_id=aserbin.1504051374.17423
To run the test without the work-around, I applied the patch below
and get 1 out of 2K failed:
http://dist-test.cloudera.org//job?job_id=aserbin.1504053764.1127
WIP: because I'm not sure whether we want this workaround now
or we would better wait for the leader leases to be implemented.
------------------------------------------------------------------------
--- a/src/kudu/integration-tests/raft_consensus-itest.cc
+++ b/src/kudu/integration-tests/raft_consensus-itest.cc
@@ -2571,7 +2571,7 @@ TEST_F(RaftConsensusITest, TestSlowLeader) {
if (!AllowSlowTests()) return;
static const int kHbIntervalMs = 32;
- static const int kMaxMissedHbPeriods = 3;
+ static const int kMaxMissedHbPeriods = 1;
const vector<string> tserver_flags = {
Substitute("--raft_heartbeat_interval_ms=$0", kHbIntervalMs),
Substitute("--leader_failure_max_missed_heartbeat_periods=$0",
@@ -2586,9 +2586,9 @@ TEST_F(RaftConsensusITest, TestSlowLeader) {
TestWorkload workload(cluster_.get());
workload.set_table_name(kTableId);
workload.set_num_read_threads(2);
- workload.set_read_retry_enabled(true);
- workload.set_read_retry_delay(
- MonoDelta::FromMilliseconds(kHbIntervalMs * kMaxMissedHbPeriods));
+ //workload.set_read_retry_enabled(true);
+ //workload.set_read_retry_delay(
+ // MonoDelta::FromMilliseconds(kHbIntervalMs * kMaxMissedHbPeriods));
workload.Setup();
workload.Start();
SleepFor(MonoDelta::FromSeconds(60));
------------------------------------------------------------------------
Change-Id: Ie5ee6c5400c947f87b1da2e76d24dd837b1270ca
---
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/test_workload.cc
M src/kudu/integration-tests/test_workload.h
3 files changed, 43 insertions(+), 12 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/87/7887/1
--
To view, visit http://gerrit.cloudera.org:8080/7887
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ie5ee6c5400c947f87b1da2e76d24dd837b1270ca
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <[email protected]>