David Ribeiro Alves has posted comments on this change. Change subject: WIP: [catalog manager] fixed deadlock on catalog shutdown ......................................................................
Patch Set 2: (7 comments) http://gerrit.cloudera.org:8080/#/c/6134/2//COMMIT_MSG Commit Message: PS2, Line 9: Fixed deadlock on system catalog manager shutdown in case of : multi-master Kudu cluster. Prior to the fix, the leader master often : hung in its 'elected-as-a-leader' callback while trying to write into : the system table. It was awaiting for completion of the system table : operations, but those were retried indefinitely since system catalog : table's Raft quorum was not available (other masters were shutdown). I think it would be worth figuring out what changed to make this happen now and not happen before. If it's failing reliably it shouldn't be too hard using git bisect. In particular I'm concerned whether it's a timing matter (new security stuff changed the timings) or if it's a new lock race problem http://gerrit.cloudera.org:8080/#/c/6134/2/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: PS2, Line 838: "Not loading sys catalog metadata"; wrap this properly and add "<<" before the new line's text. PS2, Line 907: Status CatalogManager::InitCertAuthority() { oh I see that you're refactoring. likely worth doing in a previous patch (i.e. worth doing even if we decide this patchs' strategy can't work) http://gerrit.cloudera.org:8080/#/c/6134/2/src/kudu/master/catalog_manager.h File src/kudu/master/catalog_manager.h: PS2, Line 571: InitCertAuthority docs PS2, Line 571: Status InitCertAuthority(); : Status InitTokenSigner(); did you meant this to be here? PS2, Line 633: // Check if it's time to generate a new Token Signing Key for TokenSigner. : // If so, generate one and persist it into the system table. After that, : // push it into the TokenSigner's key queue. : Status TryGenerateNewTskUnlocked(); same, seems unrelated to this change PS2, Line 759: // Check if the specified status should be considered as a fatal error for : // a leader master. Errors happened during shutdown are not considered fatal. : void CheckIfFatalLeaderError(const Status& s) const; this is very hairy. The way we treat error right now means that this would eventually (or even right now) have to do some string parsing. -- To view, visit http://gerrit.cloudera.org:8080/6134 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I10ad66fe33d4696adf2a02a09e2790afa8869583 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
