Andrew Wong has posted comments on this change. Change subject: disk failure: coordinate error handling ......................................................................
Patch Set 20: (14 comments) http://gerrit.cloudera.org:8080/#/c/7029/20/src/kudu/fs/data_dirs.h File src/kudu/fs/data_dirs.h: Line 297: uint16_t* GetUuidIdxForUuid(const std::string& uuid) { > I looked through here and I believe this is safe. Hmm, for consistency with FindUuidIndexByDataDIr, sure. http://gerrit.cloudera.org:8080/#/c/7029/20/src/kudu/fs/error_manager.h File src/kudu/fs/error_manager.h: Line 34: typedef Callback<void(const std::string&)> ErrorNotificationCb; > doc callback argument Done Line 57: void RunErrorNotificationCb(const std::string& uuid) const { > nit: arguments need docs Done Line 58: notify_cb_.Run(uuid); > It makes me a little nervous not to submit this to a thread pool. We'd have That's exactly the reason I even considered a thread pool. Since error handling at a low level may have locks held, we need to asynchronously run this at a higher level when those locks aren't held. e.g. if we hit a block error while we FlushMRS, we can't shut down the tablet because we'll be caught waiting for the rowsets_flush_sem lock. Line 61: void RunErrorNotificationCb(const DataDir* dir) const { > Can we remove this one? Ah, yeah that is meant to be in a different patch. Thanks for pointing that out. http://gerrit.cloudera.org:8080/#/c/7029/20/src/kudu/tserver/tablet_server.cc File src/kudu/tserver/tablet_server.cc: Line 141: fs_manager_->UnsetErrorNotificationCb(); > Nit: move this down to just before tablet_manager_->Shutdown() so it's clea Done http://gerrit.cloudera.org:8080/#/c/7029/16/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: Line 1067: } > Maybe pull out fs_manager_->dd_manager() into a local variable so you don't Done Line 1067: } > Did you miss this? Arg I did, thanks for pointing that out. http://gerrit.cloudera.org:8080/#/c/7029/20/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: Line 117: using rpc::ResultTracker; > warning: using decl 'ResultTracker' is unused [misc-unused-using-decls] Done Line 121: using std::unique_ptr; > warning: using decl 'unique_ptr' is unused [misc-unused-using-decls] Done Line 1072: void TSTabletManager::FailTabletsInDataDir(const string& uuid) { > Why don't we just have this be a thunk that says "something failed" and the I think the point of the thunk would be to let handling happen without knowing about which directory has failed, right? While shutting down tablets should be idempotent anyway, we should be able to limit the number of calls to Shutdown(). Also the FailTabletsInDataDir() is the one who's marking the dir as failed. Your point about tablet copies is definitely valid. I think right now if we were tablet copying while something else triggers a disk failure, the tablet copy won't be able to tell that there's a disk failure. That said, coverage for that should be in a different patch. PS20, Line 1074: !uuid_idx > Didn't think this case was possible. Hrm, considering how it should be used, it's not. I'll make that a bit more strict. Line 1081: LOG(INFO) << "Shutting down tablet (not implemented in this patch): " << tablet_id; > If this doesn't do anything, why not just have an empty body of the functio The way I've split these patches up, I'm sort of building this function up with each patch (and only using this function in the final patch, once everything else is complete). The hope was that it would help you see where the changes are being used without having to review everything, since they'll only be functional with all the patches. http://gerrit.cloudera.org:8080/#/c/7029/20/src/kudu/tserver/ts_tablet_manager.h File src/kudu/tserver/ts_tablet_manager.h: Line 188: // Forces shutdown of the tablet replicas in the data dir corresponding to 'uuid'. > I still think it'd be OK to make this private and wire/unwire in Init/Shutd As per our hipchat discussion, keeping it here -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 20 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
