[kudu-CR] [catalog manager] categorization of rw operation failures

Alexey Serbin (Code Review) Wed, 08 Mar 2017 12:12:20 -0800

Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
......................................................................

Patch Set 18:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/6170/18/src/kudu/integration-tests/catalog_manager_tsk-itest.cc
File src/kudu/integration-tests/catalog_manager_tsk-itest.cc:

PS18, Line 71: --raft_enable_pre_election=false
> ah forgot to suggest that you add this flag. nice. btw if this is a single
Yep, having this flag greatly increases chances of leadership change in this
scenario -- I found it while trying to achieve better reproduction ratio :)

That's a good observation -- yep, I think this flag makes sense only for
masters in this scenario regardless on number of involved tservers. Gooo

PS18, Line 120: ErrorStatusPB::ERROR_UNAVAILABLE
> does this error transpire all the way to the caller? i.e. could you inspect
Nope, that's not transpire to the caller -- to the caller it looks like
NotAuthorized status, and it's not possible to see the server sent
ERROR_UNAVAILABLE (except for parsing the error message, which we are not about
to do).

That is not about expired/old token. That's about a token signed with a key
which hasn't yet propagated to the target tablet server, but already received
by the client.

As for the expired token errors, there are no calls in this test which would
fail do to an expired token -- as you can see, the TokenSigner configured to
issue tokens valid for the duration of the test. Also, the token verification
happens during connection establishment, and that's the only place it's checked.

PS18, Line 121: gets authn token signed by the
: // key which hasn't yet reached the server.
> do you mean: "gets authn token signed by the master which hasn't yet reache
Here it's about the key used for token verification. So, probably it would be
easier to understand if I replace 'server' with 'tserver'.

Line 128: // NOTE: This should be removed once the client is updated to
automatically
> mark with TODO(PKI)
Done

Line 159: // Check that master servers do not crash on change of leadership
while
> yeah are you sure that the master leadership is changing?
Yes, I'm: I spent some time yesterday verifying that and discovered that
'raft_enable_pre_election' option to make it happen more often.

It's possible to see that in the log while running the test: search for
messages like 'failed to refresh TSK'. Usually, within 60 second run it
happens 20-30 times when running the debug build.

Line 159: // Check that master servers do not crash on change of leadership
while
> What's forcing leadership change in this test?
The leadership changes are provoked by the injected latency just after
generating TSK key but prior to writing it into the system table: setting
--leader_failure_max_missed_heartbeat_periods flag to just one heartbeat period
and unsetting --raft_enable_pre_election gives high chances of re-election to
happen while current leader has blocked its leadership-related activity. It's
possible to see that in logs while running the test. We want to observe
messages like 'failed to refresh TSK', and that happens pretty often.

I added the comment for the test.

Line 176: waiter.join();
> Could you change this so that SmokeTestCluster is run in the thread in a lo
Why is it any better? Instead, if some error happened in the thread, I'm not
sure it's correctly handled by the gtest -- as I remember, there was an issue
to have ASSERT_xxx() in non-main thread so it would be possible to call just
CHECK_xx() instead.

http://gerrit.cloudera.org:8080/#/c/6170/18/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

Line 473: // If there is an error (e.g., we are not the leader)
abort this task
> With the additions below, this path is no longer aborting the task. Should
Since from this point it's not clear why the error happened (it might be
something else besides leadership change), I think it's worth continuing and
try to do TSK-related work. If that was leadership-related problem, it will be
detected by the code below pretty fast.

That's why I think it's better not to put 'continue' here.

PS18, Line 499: the TokenSigner has left with no key
> this wording isn't clear. It may not be needed at all.
ok, I'll remove it if you think it's not needed. The idea was to give some
reasoning why the process was forced to crash.

--
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 18
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Dan Burkert <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

[kudu-CR] [catalog manager] categorization of rw operation failures

Reply via email to