[kudu-CR] WIP [tests] scenario to repro off-by-one error in TestWorkload
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/9277 ) Change subject: WIP [tests] scenario to repro off-by-one error in TestWorkload .. Patch Set 1: > Patch Set 1: > > I briefly tried this with my WIP implementation of leader leases and the > problem seemed to have gone away. While not definitive this is encouraging > evidence that the problem indeed stems from the lack of leader leases. > Conceptually this seems plausible too. Scenario: > - Two replicas (a,b) dispute leadership of tablet A, i.e. both think they're > leaders, but "b" is the actual most recent leader. > - Client writes the last row to replica "b". > - Client then performs a scan at snapshot for all rows on replica "a", > sending the timestamp of the write. > - Replica "a" thinking it's leader simply considers the current timestamp > "safe" and executes a (non-repeatable) scan that doesn't include the last row > written to b. > > Possible follow-up steps: > - Retry the scan. If the problem goes away, then it's more likely that the > problem stems from the lack of leader leases. > - Even better, hack it so that we retry the scan _at the same snapshot > timestamp_. > - Add logging events (test only) that register the timestamps/safe > time/chosen replica on the server side (hopefully showing the smoking gun). > - Implement leader leader and put this test on top :) Thank you David for looking at this! I'm glad to know your in-progress leader leases patch fixed the problem. I think I can start hacking this to add retry the scan at the same snapshot, as recommended. -- To view, visit http://gerrit.cloudera.org:8080/9277 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I60666b8b05dce8dd13fcdee6408c0930915ba0c1 Gerrit-Change-Number: 9277 Gerrit-PatchSet: 1 Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 21 Feb 2018 18:52:24 + Gerrit-HasComments: No
[kudu-CR] WIP [tests] scenario to repro off-by-one error in TestWorkload
David Ribeiro Alves has posted comments on this change. ( http://gerrit.cloudera.org:8080/9277 ) Change subject: WIP [tests] scenario to repro off-by-one error in TestWorkload .. Patch Set 1: I briefly tried this with my WIP implementation of leader leases and the problem seemed to have gone away. While not definitive this is encouraging evidence that the problem indeed stems from the lack of leader leases. Conceptually this seems plausible too. Scenario: - Two replicas (a,b) dispute leadership of tablet A, i.e. both think they're leaders, but "b" is the actual most recent leader. - Client writes the last row to replica "b". - Client then performs a scan at snapshot for all rows on replica "a", sending the timestamp of the write. - Replica "a" thinking it's leader simply considers the current timestamp "safe" and executes a (non-repeatable) scan that doesn't include the last row written to b. Possible follow-up steps: - Retry the scan. If the problem goes away, then it's more likely that the problem stems from the lack of leader leases. - Even better, hack it so that we retry the scan _at the same snapshot timestamp_. - Add logging events (test only) that register the timestamps/safe time/chosen replica on the server side (hopefully showing the smoking gun). - Implement leader leader and put this test on top :) -- To view, visit http://gerrit.cloudera.org:8080/9277 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I60666b8b05dce8dd13fcdee6408c0930915ba0c1 Gerrit-Change-Number: 9277 Gerrit-PatchSet: 1 Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 21 Feb 2018 00:54:04 + Gerrit-HasComments: No
[kudu-CR] WIP [tests] scenario to repro off-by-one error in TestWorkload
Alexey Serbin has uploaded this change for review. ( http://gerrit.cloudera.org:8080/9277 Change subject: WIP [tests] scenario to repro off-by-one error in TestWorkload .. WIP [tests] scenario to repro off-by-one error in TestWorkload This patch is basically patch from https://gerrit.cloudera.org/#/c/9255/ with additional updates. To repro, compile in RELEASE configuration and run locally or using dist-test: KUDU_ALLOW_SLOW_TESTS=1 ./bin/raft_consensus_nonvoter-itest \ --gtest_filter='*RemoveReplaceInCycle*' The failure rate at dist-test is about 1/16. The relevant failures look like the following (two cases at least): src/kudu/integration-tests/cluster_verifier.cc:131: Failure Failed Bad status: Corruption: row count 31950 is not exactly expected value 31949 src/kudu/integration-tests/raft_consensus_nonvoter-itest.cc:2045: Failure Expected: v.CheckRowCount(workload.table_name(), ClusterVerifier::EXACTLY, workload.rows_inserted()) doesn't generate new fatal failures in the current thread. Actual: it does. F0210 04:28:10.156322 5051 test_workload.cc:236] Check failed: row_count >= expected_row_count (31049 vs. 31050) *** Check failure stack trace: *** @ 0x9426bd google::LogMessage::Fail() at thirdparty/src/glog-0.3.5/src/logging.cc:1488 @ 0x94457d google::LogMessage::SendToLog() at thirdparty/src/glog-0.3.5/src/logging.cc:1442 @ 0x9421f9 google::LogMessage::Flush() at thirdparty/src/glog-0.3.5/src/logging.cc:1312 @ 0x94501f google::LogMessageFatal::~LogMessageFatal() at thirdparty/src/glog-0.3.5/src/logging.cc:2024 @ 0x930339 kudu::TestWorkload::ReadThread() at /opt/rh/devtoolset-3/root/usr/lib/gcc/x86_64-redhat-linux/4.9.2/../../../../include/c++/4.9.2/tr1/shared_ptr.h:340 @ 0x7f22ba943a60 (unknown) at ??:0 @ 0x7f22bbd8b184 start_thread at ??:0 @ 0x7f22ba3b0ffd clone at ??:0 @ (nil) (unknown) An example run at dist-test: http://dist-test.cloudera.org//job?job_id=aserbin.1518236014.105889 Change-Id: I60666b8b05dce8dd13fcdee6408c0930915ba0c1 --- M src/kudu/integration-tests/cluster_verifier.h M src/kudu/integration-tests/raft_consensus_nonvoter-itest.cc M src/kudu/tablet/tablet.cc M src/kudu/tserver/ts_tablet_manager.cc 4 files changed, 177 insertions(+), 11 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/77/9277/1 -- To view, visit http://gerrit.cloudera.org:8080/9277 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I60666b8b05dce8dd13fcdee6408c0930915ba0c1 Gerrit-Change-Number: 9277 Gerrit-PatchSet: 1 Gerrit-Owner: Alexey Serbin