Adar Dembo has posted comments on this change. Change subject: WIP disk failure: handle disk failures in blocks ......................................................................
Patch Set 14: (26 comments) http://gerrit.cloudera.org:8080/#/c/7030/14/src/kudu/fs/data_dirs.h File src/kudu/fs/data_dirs.h: Line 268: DataDir* FindDataDirByUuidIndex(uint16_t uuid_idx)const; Mistake? http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/error_manager.h File src/kudu/fs/error_manager.h: Line 56: RETURN_NOT_OK(s_); \ > Done Not done? Should be 'return s_'. Line 61: #define HANDLE_DISK_FAILURE(status_expr, err_handler) do { \ > I would agree, but RETURN_NOT_OK_HANDLE and RETURN_NOT_OK_HANDLE_ERROR are Oh, yes, I see what you mean: HANDLE_DISK_FAILURE only considers actual disk failures, while RETURN_NOT_OK_HANDLE considers any non-OK status. In that case, could you revert the name back to RETURN_NOT_OK_HANDLE (or RETURN_NOT_OK_RUN_HANDLER)? I agree that RETURN_NOT_OK_HANDLE_DISK_FAILURE is an inappropriate name. http://gerrit.cloudera.org:8080/#/c/7030/14/src/kudu/fs/error_manager.h File src/kudu/fs/error_manager.h: PS14, Line 48: if it results in a disk : // failure The handler is run on any failure, not just disk failures. http://gerrit.cloudera.org:8080/#/c/7030/14/src/kudu/fs/file_block_manager.cc File src/kudu/fs/file_block_manager.cc: Line 238: virtual void HandleError(const Status& s) const; Shouldn't be virtual as it's not participating in the class hierarchy at all. Line 240: DataDir* mutable_data_dir() const { Just combine into HandleError(). Line 386: return !close.ok() ? close : sync; Maybe do a single HandleError() down here instead? Line 417: virtual void HandleError(const Status& s) const; Not virtual. Line 419: DataDir* mutable_data_dir() const { This is only called by HandleError(); how about moving its implementation into HandlError() itself? Line 482: return reader_->Size(sz); Can't this fail with an IO error too? Line 539: if (PREDICT_FALSE(IsDiskFailure(env_->SyncDir(s)))) { We're no longer returning early if one SyncDir() fails. I don't necessarily disagree with that change, just wanted to make sure you made it intentionally. Line 649: if (PREDICT_FALSE(IsDiskFailure(s))) { For this failure (and/or the next one), should we try a new data directory instead of returning the error? Line 671: if (PREDICT_FALSE(IsDiskFailure(s))) { Shouldn't this (and the above) use HANDLE_DISK_FAILURE? Line 700: return Status::OK(); Shouldn't we return an actual error though? If we return OK, the tablet flush code will remove this block from the list of deleted blocks and won't retry its deletion, orphaning it. http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/log_block_manager.cc File src/kudu/fs/log_block_manager.cc: Line 498: } > Since this is a static method, the handling wraps the Create() method. With What's missing? Open() is also a static method and you've managed to insert the failure handling into it. Or, alternatively, if that results in too many handling sites, perhaps move the handling so it wraps Open() too? Then they're at least consistent. http://gerrit.cloudera.org:8080/#/c/7030/14/src/kudu/fs/log_block_manager.cc File src/kudu/fs/log_block_manager.cc: PS14, Line 682: // Handle any other errors, e.g. disk failures. : HandleError(read_status); Nit: move this under the "If we've made it here..." comment, since it applies to the HandleError() call too. Line 794: RETURN_NOT_OK(block_manager()->SyncContainer(*this)); RETURN_NOT_OK_HANDLE_ERROR? Or, alternatively, add failure handling within SyncContainer() itself. Line 1095: virtual void HandleError(const Status& s) const { Shouldn't be virtual. On second look, I'm not seeing where HandleError() or mutable_data_dir() are called. Line 1253: RETURN_NOT_OK_PREPEND(AppendMetadata(), "Unable to flush block during close"); This is wrong, look at the comment on L1234-1236. Line 1309: virtual void HandleError(const Status& s) const { Not virtual. PS14, Line 1309: virtual void HandleError(const Status& s) const { : container_->HandleError(s); : } : : DataDir* mutable_data_dir() { : return container_->mutable_data_dir(); : } Same; where is this used? Line 1517: bool is_on_ext4; Bad conflict resolution here. Line 1745: RETURN_NOT_OK_PREPEND(s, "Could not create new log block container at " + dir->dir()); Hmm, but we could try a new data directory right? http://gerrit.cloudera.org:8080/#/c/7030/14/src/kudu/tablet/mvcc.cc File src/kudu/tablet/mvcc.cc: Line 93: CHECK_EQ(RESERVED, old_state) << "transaction with timestamp " << timestamp.ToString() Belongs in the patch dealing with transactions? http://gerrit.cloudera.org:8080/#/c/7030/14/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: Line 1094: LOG(WARNING) << Substitute("Tablet $0 is located on an unhealthy data dir.", tablet_id); Likewise, the changes to this file don't seem like they belong in this patch. http://gerrit.cloudera.org:8080/#/c/7030/14/src/kudu/util/status.h File src/kudu/util/status.h: PS14, Line 466: As to not : // break ABI-compatability of the Status class, rather than changing the Status : // class, we use this function to determine disk failures. Adding a new member function to Status doesn't break ABI compatibility, so if you'd rather do that, go for it. -- To view, visit http://gerrit.cloudera.org:8080/7030 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c Gerrit-PatchSet: 14 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
