[kudu-CR] [tools] Implement a manual leader step down for a tablet

Adar Dembo (Code Review) Wed, 28 Sep 2016 14:38:16 -0700

Adar Dembo has posted comments on this change.

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................



Patch Set 2:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/4533/1/src/kudu/tools/kudu-admin-test.cc
File src/kudu/tools/kudu-admin-test.cc:

PS1, Line 204: }
> Sure, could you tell me what's the relationship between this being a main t
The CHECK_* macros come from glog, and they all do the same thing regardless of 
the context in which they're called:
1. Evaluate an expression.
2. If evaluation fails, log something.
3. Dump a stack trace.
4. Call abort() to exit the process.

The DCHECK_* macros (also from glog) are debug-only variants that are otherwise 
equivalent, but compiled out in non-debug builds.

The ASSERT_* macros come from gtest, so they're specific to how gtest works. 
They too evaluate an expression, but on failure, they do some gtest stuff to 
capture the failure, end the run of the current test, and move to the next 
test. They have some foibles, and one of them is that ASSERT failures in the 
thread that isn't the main process thread don't seem to get noticed  by gtest. 
Maybe there's something we need to do for that to happen; if there is, I don't 
know what it is.

Oh, and EXPECT_* is like ASSERT_* except that the test keeps running. So 
ASSERT_* will only ever reveal the "first" failure, while EXPECT_* will reveal 
all of them.

So to summarize:
- Use CHECK_*/DCHECK_* macros in production code. Use DCHECK_* if you expect to 
get sufficient test coverage of the callsite, CHECK_* otherwise.
- Use ASSERT_*/EXPECT_* macros in test-only code, unless that code is expected 
to run on a separate test thread, in which case use CHECK_* too. Default to 
EXPECT_*, unless the operation in question must succeed for the rest of the 
test to function properly, in which case use ASSERT_*.

Hope this helps.


Line 218:     ASSERT_OK(Subprocess::Call({
> In reality yes, but given that we are operating in a deterministic test env
Unfortunately a test environment with leader failure detection isn't 
deterministic. The test can run on a machine with arbitrary load (perhaps from 
other tests), in which case one of the tserver processes can pause for 
arbitrary periods of time, causing a new leader to be elected. This is one of 
the leading causes of (low-grade) test flakiness for us.


PS1, Line 230:     CHECK_OK(GetInfoFromConsensus(tservers, tablet_id_,
             :                                   &new_leader, &new_term));
> I tried this with an unsuccessful result, later realized AssertEventually()
Couple things:

1. I don't understand why AssertEventually() doesn't work here. AFAICT the goal 
of this snippet is "tell the leader to step down, wait for an election to 
happen, then assert that the new term is higher than the one before". If you 
wrap L229-L233 in AssertEventually(), doesn't that mean you don't need the 3 
second sleep (which could be flaky to begin with, since a cluster running with 
ASAN/TSAN could take a lot longer to run an election)? I don't see what the 
issue is; can you help me understand? Look at some of the existing usages of 
AssertEventually() if it's not clear.
2. I'm not seeing the bug in the code you pasted. What is it?


Line 264:   s = GetInfoFromConsensus(tservers, tablet_id_,
> Done
The actual check (not in the CHECK sense of the word, but the comparison and 
assertion) is still here, though.


http://gerrit.cloudera.org:8080/#/c/4533/2/src/kudu/tools/kudu-admin-test.cc
File src/kudu/tools/kudu-admin-test.cc:

PS2, Line 251: CHECK_OK
Still got some leftover CHECK_* statements that should be converted to their 
ASSERT_* equivalents.


http://gerrit.cloudera.org:8080/#/c/4533/1/src/kudu/tools/tool_action_tablet.cc
File src/kudu/tools/tool_action_tablet.cc:

PS1, Line 194: 
> Done, although not sure if we need a toggle given that the routine anyways 
On second thought, the ResolveAddresses output isn't even used, is it? So let's 
drop it altogether. Looks like address resolution happens within BuildProxy() 
so that extra call was redundant from day one (in ConfigChange() as well).


PS1, Line 274: 
             : 
> Nit: reformat as "Force the tablet's leader replica (if present) to step do
You missed these comments down here.


-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Dinesh Bhat <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Reply via email to