Dinesh Bhat has posted comments on this change.
Change subject: [tools] Implement a manual leader_step_down for a tablet
Patch Set 3:
TFTR Adar, pls see responses inline, also pls see why my jenkins are still
failing. I see that they are still using ...llvm-3.8.0.src/ as per the logs.
PS1, Line 204: string c
> The CHECK_* macros come from glog, and they all do the same thing regardles
Thank you for that awesome clarity on when to use them. This has gone in my
favorite notes :)
Line 218: tablet_id_
> Unfortunately a test environment with leader failure detection isn't determ
sure, removed the comparison check, relying on master to report leader info
now. I have kept the getconsensus intact to grab the new term info.
PS1, Line 230: "leader_failure_max_missed_heartbeat_periods",
> Couple things:
Sorry for not being clear before. Lemme try to explain although I haven't root
caused the issue myself yet. When I used AssertEventually the GetConsensus
comes out around 1.5 secs or so with a new_term. But what is not clear to me
yet is, if I sleep for ~1.5 secs to allow the consensus to settle down, I see
eventually a new_leader emerging which won't happen if I keep using
GetConsensus via AssertEventually(despite 20 retries). Logs are confimring that
GetConsensus indeed is detecting the new leader correctly, it's just that
new_leader happens to be the current_leader everytime.
I want to see why I see same leader gets elected again and again unless I allow
some more time window before calling GetConsensus in which case I see
new_leader coming up after few retries. With the fix of '++attempts' however, I
eventually see another leader coming up so thats kinda hiding the problem. I
think that ++attempts fix should have been there to keep the sleep time
PS1, Line 241: ASSERT_OK(GetTermFromConsensus(tservers, tablet_id_,
> Again, the leader can change whenever, so this doesn't seem safe.
> The actual check (not in the CHECK sense of the word, but the comparison an
yeah, missed out removing here i think. done.
PS2, Line 251: AGS_num_
> Still got some leftover CHECK_* statements that should be converted to thei
Done, thanks for explaining the distinctions earlier.
PS1, Line 274:
> You missed these comments down here.
Ugh... sorry, for some reason I totally missed out scrolling down thinking that
only change I had introduced was in the routine above :)
PS1, Line 278:
> I admit, this one was my bad, but can you consolidate all of these descript
PS1, Line 280:
> Likewise for this, it's copied extensively.
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Owner: Dinesh Bhat <din...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dral...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <din...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>