David Ribeiro Alves has posted comments on this change.

Change subject: WIP: [catalog manager] fixed deadlock on catalog shutdown
......................................................................


Patch Set 2:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/6134/2//COMMIT_MSG
Commit Message:

PS2, Line 9: Fixed deadlock on system catalog manager shutdown in case of
           : multi-master Kudu cluster.  Prior to the fix, the leader master 
often
           : hung in its 'elected-as-a-leader' callback while trying to write 
into
           : the system table.  It was awaiting for completion of the system 
table
           : operations, but those were retried indefinitely since system 
catalog
           : table's Raft quorum was not available (other masters were 
shutdown).
I think it would be worth figuring out what changed to make this happen now and 
not happen before. If it's failing reliably it shouldn't be too hard using git 
bisect.

In particular I'm concerned whether it's a timing matter (new security stuff 
changed the timings) or if it's a new lock race problem


http://gerrit.cloudera.org:8080/#/c/6134/2/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

PS2, Line 838: "Not loading sys catalog metadata";
wrap this properly and add "<<" before the new line's text.


PS2, Line 907: Status CatalogManager::InitCertAuthority() {
oh I see that you're refactoring. likely worth doing in a previous patch (i.e. 
worth doing even if we decide this patchs' strategy can't work)


http://gerrit.cloudera.org:8080/#/c/6134/2/src/kudu/master/catalog_manager.h
File src/kudu/master/catalog_manager.h:

PS2, Line 571: InitCertAuthority
docs


PS2, Line 571:   Status InitCertAuthority();
             :   Status InitTokenSigner();
did you meant this to be here?


PS2, Line 633:   // Check if it's time to generate a new Token Signing Key for 
TokenSigner.
             :   // If so, generate one and persist it into the system table. 
After that,
             :   // push it into the TokenSigner's key queue.
             :   Status TryGenerateNewTskUnlocked();
same, seems unrelated to this change


PS2, Line 759:   // Check if the specified status should be considered as a 
fatal error for
             :   // a leader master. Errors happened during shutdown are not 
considered fatal.
             :   void CheckIfFatalLeaderError(const Status& s) const;
this is very hairy. The way we treat error right now means that this would 
eventually (or even right now) have to do some string parsing.


-- 
To view, visit http://gerrit.cloudera.org:8080/6134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I10ad66fe33d4696adf2a02a09e2790afa8869583
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

Reply via email to