Dinesh Bhat has posted comments on this change. Change subject: [tools] Implement a manual leader_step_down for a tablet ......................................................................
Patch Set 7: (2 comments) http://gerrit.cloudera.org:8080/#/c/4533/6/src/kudu/tools/kudu-admin-test.cc File src/kudu/tools/kudu-admin-test.cc: PS6, Line 219: --enable_leader_failure_detection=false > Why do you need to override the default timeout value of 30s? timeout_ms is Yeah, I am reverting to default timeout after bunch of tests I executed on dist_test this morning. Initial thought was to allow just enough room for consensus to change, but turned out to be a bad idea. PS6, Line 222: pendValuesFromMap(tablet_servers_, &tservers); > There's no difference in the outcome of the test if we retry 10 times witho Bunch of things I narrowed down here this morning: - The same leader getting elected was just an observation I made on local testbed, when moved to dist_test I was able to see different leader at times(and same leader as well bunch of times). - Also spoke to David on few things, it turns out a leader may never get elected even after terms are advanced multiple times which makes this retry loop even more flaky-prone. This is consistent with my test on dist_test too where I observed couple of times an election resulting into no leader emerging on other side and subsequently failed with an error code from the actual kudu command L208 "Illegal state: not currently a leader". - I also kept just one routine grabbing just the consensus term instead of going to master for leader_uuid, as it turns out master could have a stale info as well. This is inline with what you suggested in one of your earlier comments. Given your above suggestions and my test results, I am inclined to keep the code as following now: if (SubProcess::Call('kudu leader_step_down')) { AssertEventually([&]() { GetINfoFromConsensus(leader_uuid, new_term) ASSERT_GT(new_term, current_term); } } - The reason I am hesitant to ASSERT on leader_step_down command is because as we know leader might have changed in the background not resulting in a new leader, and also I don't happen to be waiting for a leader election after cluster is setup, in which case it's perfectly possible that the command will fail. We can expect the term to have advanced in 30 secs however if the command succeeds, hence safe to assert on that. -- To view, visit http://gerrit.cloudera.org:8080/4533 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8 Gerrit-PatchSet: 7 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Dan Burkert <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Dinesh Bhat <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
