Andrew Wong has posted comments on this change. Change subject: disk failure: error-handling macros in blocks ......................................................................
Patch Set 17: (26 comments) http://gerrit.cloudera.org:8080/#/c/7030/14/src/kudu/fs/data_dirs.h File src/kudu/fs/data_dirs.h: Line 268: DataDir* FindDataDirByUuidIndex(uint16_t uuid_idx) const; > Mistake? Done http://gerrit.cloudera.org:8080/#/c/7030/14/src/kudu/fs/error_manager.h File src/kudu/fs/error_manager.h: PS14, Line 48: : return s > The handler is run on any failure, not just disk failures. Changed this to call on disk failures, which I found to actually be useful. http://gerrit.cloudera.org:8080/#/c/7030/14/src/kudu/fs/file_block_manager.cc File src/kudu/fs/file_block_manager.cc: Line 238: void HandleError(const Status& s) const; > Shouldn't be virtual as it's not participating in the class hierarchy at al Done Line 240: private: > Just combine into HandleError(). Done Line 386: > Maybe do a single HandleError() down here instead? I left this as two separate calls error-handling calls (e.g. if they have different errors), but I did move them down here. Line 417: private: > Not virtual. Done Line 419: FileBlockManager* block_manager_; > This is only called by HandleError(); how about moving its implementation i Done Line 482: return ReadV(offset, &results); > Can't this fail with an IO error too? Done Line 539: return Status::OK(); > We're no longer returning early if one SyncDir() fails. I don't necessarily This was unintentional, thanks for checking. Here I think it'd be best to keep the earlier behavior but with the error handling callback. Line 649: if (s.ok()) { > For this failure (and/or the next one), should we try a new data directory I'm not sure what that would buy us since at that point, we already know a slice of the tablet is effectively unusable and the tablet will be shut down. Line 671: error_manager_->RunErrorNotificationCb(dd_manager()->FindDataDirByUuidIndex( \ > Shouldn't this (and the above) use HANDLE_DISK_FAILURE? Done Line 700: > Shouldn't we return an actual error though? If we return OK, the tablet flu Right. Actually I'll wrap file_cache_.DeleteFile() with a macro, since it should be caught there anyway. Same with the above. http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/log_block_manager.cc File src/kudu/fs/log_block_manager.cc: Line 498: > What's missing? Open() is also a static method and you've managed to insert Fair point. A couple more macros should do it. http://gerrit.cloudera.org:8080/#/c/7030/14/src/kudu/fs/log_block_manager.cc File src/kudu/fs/log_block_manager.cc: PS14, Line 682: HandleError(read_status); : return read_status; > Nit: move this under the "If we've made it here..." comment, since it appli Done Line 794: scoped_refptr<LogBlock> lb = block_manager()->AddLogBlock( > RETURN_NOT_OK_HANDLE_ERROR? Done. I put it in SyncContainer. Line 1095: } > Shouldn't be virtual. HandleError() is called any place with RETURN_NOT_OK_HANDLE_ERROR(). You're right that mutable_data_dir() isn't needed anymore. Line 1253: VLOG(3) << "Syncing block " << id(); > This is wrong, look at the comment on L1234-1236. Ah I see, thanks for pointing that out. Done Line 1309: // The owning container. Must outlive this block. > Not virtual. Done PS14, Line 1309: // The owning container. Must outlive this block. : LogBlockContainer* container_; : : // A reference to this block's metadata. : scoped_refptr<internal::LogBlock> log_block_; : : / > Same; where is this used? Done Line 1517: if (untested_block_size) { > Bad conflict resolution here. Done Line 1745: } > Hmm, but we could try a new data directory right? This still indicates that part of the tablet's data is probably lost due to disk failure, so, in the context of disk failures, I think it makes sense to error out here and shut down the tablet. http://gerrit.cloudera.org:8080/#/c/7030/14/src/kudu/tablet/mvcc.cc File src/kudu/tablet/mvcc.cc: Line 93: CHECK_EQ(old_state, RESERVED) << "transaction with timestamp " << timestamp.ToString() > Belongs in the patch dealing with transactions? Done http://gerrit.cloudera.org:8080/#/c/7030/14/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: Line 1094 > Likewise, the changes to this file don't seem like they belong in this patc Done http://gerrit.cloudera.org:8080/#/c/7030/15/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: Line 602: return Status::NotFound("Tablet not found", tablet_id); > error: no viable conversion from 'scoped_refptr<kudu::tablet::TabletReplica Done Line 607: *error_code = TabletServerErrorPB::TABLET_NOT_RUNNING; > error: use of undeclared identifier 'reason' [clang-diagnostic-error] Done http://gerrit.cloudera.org:8080/#/c/7030/14/src/kudu/util/status.h File src/kudu/util/status.h: PS14, Line 466: : : inline Status& Status::operator=(Status&& s) { > Adding a new member function to Status doesn't break ABI compatibility, so Done -- To view, visit http://gerrit.cloudera.org:8080/7030 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c Gerrit-PatchSet: 17 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
