Alexey Serbin has posted comments on this change. Change subject: [catalog_manager] categorization of rw operation failures ......................................................................
Patch Set 18: (9 comments) http://gerrit.cloudera.org:8080/#/c/6170/18/src/kudu/integration-tests/catalog_manager_tsk-itest.cc File src/kudu/integration-tests/catalog_manager_tsk-itest.cc: PS18, Line 71: --raft_enable_pre_election=false > ah forgot to suggest that you add this flag. nice. btw if this is a single Yep, having this flag greatly increases chances of leadership change in this scenario -- I found it while trying to achieve better reproduction ratio :) That's a good observation -- yep, I think this flag makes sense only for masters in this scenario regardless on number of involved tservers. Gooo PS18, Line 120: ErrorStatusPB::ERROR_UNAVAILABLE > does this error transpire all the way to the caller? i.e. could you inspect Nope, that's not transpire to the caller -- to the caller it looks like NotAuthorized status, and it's not possible to see the server sent ERROR_UNAVAILABLE (except for parsing the error message, which we are not about to do). That is not about expired/old token. That's about a token signed with a key which hasn't yet propagated to the target tablet server, but already received by the client. As for the expired token errors, there are no calls in this test which would fail do to an expired token -- as you can see, the TokenSigner configured to issue tokens valid for the duration of the test. Also, the token verification happens during connection establishment, and that's the only place it's checked. PS18, Line 121: gets authn token signed by the : // key which hasn't yet reached the server. > do you mean: "gets authn token signed by the master which hasn't yet reache Here it's about the key used for token verification. So, probably it would be easier to understand if I replace 'server' with 'tserver'. Line 128: // NOTE: This should be removed once the client is updated to automatically > mark with TODO(PKI) Done Line 159: // Check that master servers do not crash on change of leadership while > yeah are you sure that the master leadership is changing? Yes, I'm: I spent some time yesterday verifying that and discovered that 'raft_enable_pre_election' option to make it happen more often. It's possible to see that in the log while running the test: search for messages like 'failed to refresh TSK'. Usually, within 60 second run it happens 20-30 times when running the debug build. Line 159: // Check that master servers do not crash on change of leadership while > What's forcing leadership change in this test? The leadership changes are provoked by the injected latency just after generating TSK key but prior to writing it into the system table: setting --leader_failure_max_missed_heartbeat_periods flag to just one heartbeat period and unsetting --raft_enable_pre_election gives high chances of re-election to happen while current leader has blocked its leadership-related activity. It's possible to see that in logs while running the test. We want to observe messages like 'failed to refresh TSK', and that happens pretty often. I added the comment for the test. Line 176: waiter.join(); > Could you change this so that SmokeTestCluster is run in the thread in a lo Why is it any better? Instead, if some error happened in the thread, I'm not sure it's correctly handled by the gtest -- as I remember, there was an issue to have ASSERT_xxx() in non-main thread so it would be possible to call just CHECK_xx() instead. http://gerrit.cloudera.org:8080/#/c/6170/18/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: Line 473: // If there is an error (e.g., we are not the leader) abort this task > With the additions below, this path is no longer aborting the task. Should Since from this point it's not clear why the error happened (it might be something else besides leadership change), I think it's worth continuing and try to do TSK-related work. If that was leadership-related problem, it will be detected by the code below pretty fast. That's why I think it's better not to put 'continue' here. PS18, Line 499: the TokenSigner has left with no key > this wording isn't clear. It may not be needed at all. ok, I'll remove it if you think it's not needed. The idea was to give some reasoning why the process was forced to crash. -- To view, visit http://gerrit.cloudera.org:8080/6170 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8 Gerrit-PatchSet: 18 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Dan Burkert <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
