Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
......................................................................


Patch Set 18:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/6170/18/src/kudu/integration-tests/catalog_manager_tsk-itest.cc
File src/kudu/integration-tests/catalog_manager_tsk-itest.cc:

PS18, Line 71: --raft_enable_pre_election=false
> ah forgot to suggest that you add this flag. nice. btw if this is a single 
Yep, having this flag greatly increases chances of leadership change in this 
scenario -- I found it while trying to achieve better reproduction ratio :)

That's a good observation -- yep, I think this flag makes sense only for 
masters in this scenario regardless on number of involved tservers.  Gooo


PS18, Line 120: ErrorStatusPB::ERROR_UNAVAILABLE
> does this error transpire all the way to the caller? i.e. could you inspect
Nope, that's not transpire to the caller -- to the caller it looks like 
NotAuthorized status, and it's not possible to see the server sent 
ERROR_UNAVAILABLE (except for parsing the error message, which we are not about 
to do).

That is not about expired/old token.  That's about a token signed with a key 
which hasn't yet propagated to the target tablet server, but already received 
by the client.

As for the expired token errors, there are no calls in this test which would 
fail do to an expired token -- as you can see, the TokenSigner configured to 
issue tokens valid for the duration of the test.  Also, the token verification 
happens during connection establishment, and that's the only place it's checked.


PS18, Line 121: gets authn token signed by the
              :     // key which hasn't yet reached the server.
> do you mean: "gets authn token signed by the master which hasn't yet reache
Here it's about the key used for token verification.  So, probably it would be 
easier to understand if I replace 'server' with 'tserver'.


Line 128:     // NOTE: This should be removed once the client is updated to 
automatically
> mark with TODO(PKI)
Done


Line 159: // Check that master servers do not crash on change of leadership 
while
> yeah are you sure that the master leadership is changing?
Yes, I'm: I spent some time yesterday verifying that and discovered that 
'raft_enable_pre_election' option to make it happen more often.

It's possible to see that in the log while running the test: search for 
messages like 'failed to refresh TSK'.  Usually, within 60 second run it 
happens 20-30 times when running the debug build.


Line 159: // Check that master servers do not crash on change of leadership 
while
> What's forcing leadership change in this test?
The leadership changes are provoked by the injected latency just after 
generating TSK key but prior to writing it into the system table: setting 
--leader_failure_max_missed_heartbeat_periods flag to just one heartbeat period 
and unsetting --raft_enable_pre_election gives high chances of re-election to 
happen while current leader has blocked its leadership-related activity.  It's 
possible to see that in logs while running the test.  We want to observe 
messages like 'failed to refresh TSK', and that happens pretty often.

I added the comment for the test.


Line 176:   waiter.join();
> Could you change this so that SmokeTestCluster is run in the thread in a lo
Why is it any better?  Instead, if some error happened in the thread, I'm not 
sure it's correctly handled by the gtest -- as I remember, there was an issue 
to have ASSERT_xxx() in non-main thread so it would be possible to call just 
CHECK_xx() instead.


http://gerrit.cloudera.org:8080/#/c/6170/18/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

Line 473:             // If there is an error (e.g., we are not the leader) 
abort this task
> With the additions below, this path is no longer aborting the task.  Should
Since from this point it's not clear why the error happened (it might be 
something else besides leadership change), I think it's worth continuing and 
try to do TSK-related work.  If that was leadership-related problem, it will be 
detected by  the code below pretty fast.

That's why I think it's better not to put 'continue' here.


PS18, Line 499: the TokenSigner has left with no key
> this wording isn't clear.  It may not be needed at all.
ok, I'll remove it if you think it's not needed.  The idea was to give some 
reasoning why the process was forced to crash.


-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 18
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Dan Burkert <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

Reply via email to