Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/8395 )
Change subject: error_manager: synchronize/serialize handling ...................................................................... Patch Set 3: (3 comments) http://gerrit.cloudera.org:8080/#/c/8395/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8395/3//COMMIT_MSG@13 PS3, Line 13: : The state of a tablet server post-disk-failure depends significantly on : the completion of disk-failure-handling callbacks. I.e. error handling : _must_ finish before anything is propagated back to the offending caller. : This is trickier when multiple calls are in flight that may trigger : error handling for a single tablet. Andrew and I discussed this at length; let me try to summarize what we talked about. The error notification callback alluded to isn't in this patch, or any other published patch yet. It's in another patch that Andrew is working on. Roughly speaking, here's what it does: 1. Mark the data dir as failed. 2. Get all the tablets using that data dir. 3. For each tablet: 3a. Stop the tablet. 3b. Cancel its outstanding maintenance ops 3c. Asynchronously stop its replica. Also importantly, some disk failure paths invoke this callback, while others (e.g. all data directories have failed) do not; they just return the error up the stack. The invariant Andrew is alluding to is a new one: an MM op that fails due to a disk failure will now CHECK that its tablet has been stopped. So here's the issue: 1. There are two outstanding MM ops for a given tablet with just one data dir. 2. T1 encounters a disk failure and begins running the error notification callback. It gets as far as marking the data dir as failed. 3. T2 tries to get a new block and gets an "all data directories have failed" error. 4. T2 finishes its op with this error before T1 makes more progress. 5. T2 fails the CHECK because T1 has yet to stop the tablet. I found the indirect "wait for an error notification callback" approach to be confusing, and suggested an alternative: errors like "all data directories have failed" should also trigger an error notification callback, a different kind that provides the associated tablet ID instead of the data directory. The error manager should synchronize this callback type the same way as the existing type: by acquiring a mutex while running the callback. What I like about this alternative is that, from the perspective of the block managers it's business as usual: there was an IOError and I need to run a notification callback. There's no need to choose between running a callback and waiting. http://gerrit.cloudera.org:8080/#/c/8395/3/src/kudu/fs/error_manager.h File src/kudu/fs/error_manager.h: http://gerrit.cloudera.org:8080/#/c/8395/3/src/kudu/fs/error_manager.h@84 PS3, Line 84: class FsErrorManager { Could you start a error_manager.cc file and move most (if not all?) of the implementations there? AFAICT none of the methods here are performance critical (and thus deserving of inlining), and we could forward declare more (and include less) in the header if the implementations were hidden. http://gerrit.cloudera.org:8080/#/c/8395/3/src/kudu/fs/error_manager.h@136 PS3, Line 136: // Protects access to the threadpool. Not necessary; ThreadPool is already fully thread safe. -- To view, visit http://gerrit.cloudera.org:8080/8395 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ie61c408a0b4424f933f40a31147568c2f906be0e Gerrit-Change-Number: 8395 Gerrit-PatchSet: 3 Gerrit-Owner: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-Comment-Date: Thu, 26 Oct 2017 23:54:55 +0000 Gerrit-HasComments: Yes