Dinesh Bhat has posted comments on this change.

Change subject: [tools] Implement a manual leader_step_down for a tablet

Patch Set 7:


File src/kudu/tools/kudu-admin-test.cc:

PS6, Line 219: --enable_leader_failure_detection=false
> Why do you need to override the default timeout value of 30s? timeout_ms is
Yeah, I am reverting to default timeout after bunch of tests I executed on 
dist_test this morning. Initial thought was to allow just enough room for 
consensus to change, but turned out to be a bad idea.

PS6, Line 222: pendValuesFromMap(tablet_servers_, &tservers);
> There's no difference in the outcome of the test if we retry 10 times witho
Bunch of things I narrowed down here this morning:
- The same leader getting elected was just an observation I made on local 
testbed, when moved to dist_test I was able to see different leader at 
times(and same leader as well bunch of times).

- Also spoke to David on few things, it turns out a leader may never get 
elected even after terms are advanced multiple times which makes this retry 
loop even more flaky-prone. This is consistent with my test on dist_test too 
where I observed couple of times an election resulting into no leader emerging 
on other side and subsequently failed with an error code from the actual kudu 
command L208 "Illegal state: not currently a leader". 

- I also kept just one routine grabbing just the consensus term instead of 
going to master for leader_uuid, as it turns out master could have a stale info 
as well. This is inline with what you suggested in one of your earlier comments.

Given your above suggestions and my test results, I am inclined to keep the 
code as following now:

if (SubProcess::Call('kudu leader_step_down')) {
  AssertEventually([&]() {
        GetINfoFromConsensus(leader_uuid, new_term)
        ASSERT_GT(new_term, current_term);

- The reason I am hesitant to ASSERT on leader_step_down command is because as 
we know leader might have changed in the background not resulting in a new 
leader, and also I don't happen to be waiting for a leader election after 
cluster is setup, in which case it's perfectly possible that the command will 
fail. We can expect the term to have advanced in 30 secs however if the command 
succeeds, hence safe to assert on that.

To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <din...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <d...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dral...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <din...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-HasComments: Yes

Reply via email to