Andrew Wong has posted comments on this change.

Change subject: disk failure: coordinate error handling
......................................................................


Patch Set 20:

(14 comments)

http://gerrit.cloudera.org:8080/#/c/7029/20/src/kudu/fs/data_dirs.h
File src/kudu/fs/data_dirs.h:

Line 297:   uint16_t* GetUuidIdxForUuid(const std::string& uuid) {
> I looked through here and I believe this is safe.
Hmm, for consistency with FindUuidIndexByDataDIr, sure.


http://gerrit.cloudera.org:8080/#/c/7029/20/src/kudu/fs/error_manager.h
File src/kudu/fs/error_manager.h:

Line 34: typedef Callback<void(const std::string&)> ErrorNotificationCb;
> doc callback argument
Done


Line 57:   void RunErrorNotificationCb(const std::string& uuid) const {
> nit: arguments need docs
Done


Line 58:     notify_cb_.Run(uuid);
> It makes me a little nervous not to submit this to a thread pool. We'd have
That's exactly the reason I even considered a thread pool. Since error handling 
at a low level may have locks held, we need to asynchronously run this at a 
higher level when those locks aren't held.

e.g. if we hit a block error while we FlushMRS, we can't shut down the tablet 
because we'll be caught waiting for the rowsets_flush_sem lock.


Line 61:   void RunErrorNotificationCb(const DataDir* dir) const {
> Can we remove this one?
Ah, yeah that is meant to be in a different patch. Thanks for pointing that out.


http://gerrit.cloudera.org:8080/#/c/7029/20/src/kudu/tserver/tablet_server.cc
File src/kudu/tserver/tablet_server.cc:

Line 141:     fs_manager_->UnsetErrorNotificationCb();
> Nit: move this down to just before tablet_manager_->Shutdown() so it's clea
Done


http://gerrit.cloudera.org:8080/#/c/7029/16/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 1067:   }
> Maybe pull out fs_manager_->dd_manager() into a local variable so you don't
Done


Line 1067:   }
> Did you miss this?
Arg I did, thanks for pointing that out.


http://gerrit.cloudera.org:8080/#/c/7029/20/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 117: using rpc::ResultTracker;
> warning: using decl 'ResultTracker' is unused [misc-unused-using-decls]
Done


Line 121: using std::unique_ptr;
> warning: using decl 'unique_ptr' is unused [misc-unused-using-decls]
Done


Line 1072: void TSTabletManager::FailTabletsInDataDir(const string& uuid) {
> Why don't we just have this be a thunk that says "something failed" and the
I think the point of the thunk would be to let handling happen without knowing 
about which directory has failed, right? While shutting down tablets should be 
idempotent anyway, we should be able to limit the number of calls to Shutdown().

Also the FailTabletsInDataDir() is the one who's marking the dir as failed.

Your point about tablet copies is definitely valid. I think right now if we 
were tablet copying while something else triggers a disk failure, the tablet 
copy won't be able to tell that there's a disk failure. That said, coverage for 
that should be in a different patch.


PS20, Line 1074: !uuid_idx
> Didn't think this case was possible.
Hrm, considering how it should be used, it's not. I'll make that a bit more 
strict.


Line 1081:     LOG(INFO) << "Shutting down tablet (not implemented in this 
patch): " << tablet_id;
> If this doesn't do anything, why not just have an empty body of the functio
The way I've split these patches up, I'm sort of building this function up with 
each patch (and only using this function in the final patch, once everything 
else is complete).

The hope was that it would help you see where the changes are being used 
without having to review everything, since they'll only be functional with all 
the patches.


http://gerrit.cloudera.org:8080/#/c/7029/20/src/kudu/tserver/ts_tablet_manager.h
File src/kudu/tserver/ts_tablet_manager.h:

Line 188:   // Forces shutdown of the tablet replicas in the data dir 
corresponding to 'uuid'.
> I still think it'd be OK to make this private and wire/unwire in Init/Shutd
As per our hipchat discussion, keeping it here


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 20
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

Reply via email to