Andrew Wong has posted comments on this change. Change subject: disk failure: coordinate error handling ......................................................................
Patch Set 3: (12 comments) http://gerrit.cloudera.org:8080/#/c/7029/8/src/kudu/fs/error_manager.h File src/kudu/fs/error_manager.h: PS8, Line 60: LOG(WARNING) << strings::Substitute("Dir $0 failed, marking for failure", dir->dir()); : uint16_t uuid_idx; : if (dd_manager_->FindUuidIndexByDataDir(dir, &uuid_idx)) { : // If the directory is registered as failed, ignore it. : if (dd_manager_->IsDataDirFailed( > hm, I dont think "fallthrough intended" is really necessary here. Usually s Done PS8, Line 74: : return; > not sure what this means Done PS8, Line 80: > maybe use a typedef for this? Done Line 81: // Must outlive the eio error manager. > std::move(cb)? Done Line 125 > nit: I think it's better to keep the specifics of what the callback _does_ Done. I agree the error_manager is meant to handle more than just disk failures. I was thinking of having something like (failure type => get_observers), (failure type => notify_cb), (failure type => do other stuff) might be warranted later on. Or just leave it completely open-ended (failure type => handle error). The question is: design that now or when we have more handleable errors? http://gerrit.cloudera.org:8080/#/c/7029/8/src/kudu/fs/fs_manager.h File src/kudu/fs/fs_manager.h: PS8, Line 121: * > why is this taking a pointer to a callback instead of just taking a callbac Done Line 214: void SetTabletFailedCallback(Callback<void(std::set<std::string>)>* cb); > same Done http://gerrit.cloudera.org:8080/#/c/7029/8/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: Line 684: LOG(WARNING) << s.ToString(); > not quite following the purpose of this parameter. In the case that delete_ Done. Split out the above code to transition tablet state. Only that and replica->Shutdown() are needed from what I can tell. Line 804: MonoTime start(MonoTime::Now()); > is there a DCHECK we could add here that the replica is already marked fail Moving this logic out of this patch, but done. http://gerrit.cloudera.org:8080/#/c/7029/8/src/kudu/tserver/ts_tablet_manager.h File src/kudu/tserver/ts_tablet_manager.h: Line 131: > can you update the doc as to why you would want to DeleteTablet without del Took your other suggestion of bring out the needed logic out into its own function. PS8, Line 190: // TODO(awong): rather than deleting the tablet, don't do anything to the : // tablet metadata and just store disk failure state instead. : void FailTabletReplicas(const std::set<std::string>& tablet_ids); : > the docs don't seem to match the name here. Is this shutting it down or try Done Line 294: > I think if you made the other classes take the callback by value instead of Done -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
