Andrew Wong has posted comments on this change.

Change subject: disk failure: coordinate error handling
......................................................................


Patch Set 3:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/7029/8/src/kudu/fs/error_manager.h
File src/kudu/fs/error_manager.h:

PS8, Line 60:     LOG(WARNING) << strings::Substitute("Dir $0 failed, marking 
for failure", dir->dir());
            :     uint16_t uuid_idx;
            :     if (dd_manager_->FindUuidIndexByDataDir(dir, &uuid_idx)) {
            :       // If the directory is registered as failed, ignore it.
            :       if (dd_manager_->IsDataDirFailed(
> hm, I dont think "fallthrough intended" is really necessary here. Usually s
Done


PS8, Line 74: 
            :       return;
> not sure what this means
Done


PS8, Line 80: 
> maybe use a typedef for this?
Done


Line 81:   // Must outlive the eio error manager.
> std::move(cb)?
Done


Line 125
> nit: I think it's better to keep the specifics of what the callback _does_ 
Done. I agree the error_manager is meant to handle more than just disk failures.

I was thinking of having something like (failure type => get_observers), 
(failure type => notify_cb), (failure type => do other stuff) might be 
warranted later on. Or just leave it completely open-ended (failure type => 
handle error). The question is: design that now or when we have more handleable 
errors?


http://gerrit.cloudera.org:8080/#/c/7029/8/src/kudu/fs/fs_manager.h
File src/kudu/fs/fs_manager.h:

PS8, Line 121: *
> why is this taking a pointer to a callback instead of just taking a callbac
Done


Line 214:   void SetTabletFailedCallback(Callback<void(std::set<std::string>)>* 
cb);
> same
Done


http://gerrit.cloudera.org:8080/#/c/7029/8/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 684:     LOG(WARNING) << s.ToString();
> not quite following the purpose of this parameter. In the case that delete_
Done. Split out the above code to transition tablet state. Only that and 
replica->Shutdown() are needed from what I can tell.


Line 804:   MonoTime start(MonoTime::Now());
> is there a DCHECK we could add here that the replica is already marked fail
Moving this logic out of this patch, but done.


http://gerrit.cloudera.org:8080/#/c/7029/8/src/kudu/tserver/ts_tablet_manager.h
File src/kudu/tserver/ts_tablet_manager.h:

Line 131: 
> can you update the doc as to why you would want to DeleteTablet without del
Took your other suggestion of bring out the needed logic out into its own 
function.


PS8, Line 190:   // TODO(awong): rather than deleting the tablet, don't do 
anything to the
             :   // tablet metadata and just store disk failure state instead.
             :   void FailTabletReplicas(const std::set<std::string>& 
tablet_ids);
             : 
> the docs don't seem to match the name here. Is this shutting it down or try
Done


Line 294: 
> I think if you made the other classes take the callback by value instead of
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

Reply via email to