Adar Dembo has posted comments on this change.

Change subject: WIP disk failure: handle disk failures in blocks
......................................................................


Patch Set 14:

(26 comments)

http://gerrit.cloudera.org:8080/#/c/7030/14/src/kudu/fs/data_dirs.h
File src/kudu/fs/data_dirs.h:

Line 268:   DataDir* FindDataDirByUuidIndex(uint16_t uuid_idx)const;
Mistake?


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/error_manager.h
File src/kudu/fs/error_manager.h:

Line 56:   RETURN_NOT_OK(s_); \
> Done
Not done? Should be 'return s_'.


Line 61: #define HANDLE_DISK_FAILURE(status_expr, err_handler) do { \
> I would agree, but RETURN_NOT_OK_HANDLE and RETURN_NOT_OK_HANDLE_ERROR are 
Oh, yes, I see what you mean: HANDLE_DISK_FAILURE only considers actual disk 
failures, while RETURN_NOT_OK_HANDLE considers any non-OK status.

In that case, could you revert the name back to RETURN_NOT_OK_HANDLE (or 
RETURN_NOT_OK_RUN_HANDLER)? I agree that RETURN_NOT_OK_HANDLE_DISK_FAILURE is 
an inappropriate name.


http://gerrit.cloudera.org:8080/#/c/7030/14/src/kudu/fs/error_manager.h
File src/kudu/fs/error_manager.h:

PS14, Line 48: if it results in a disk
             : // failure
The handler is run on any failure, not just disk failures.


http://gerrit.cloudera.org:8080/#/c/7030/14/src/kudu/fs/file_block_manager.cc
File src/kudu/fs/file_block_manager.cc:

Line 238:   virtual void HandleError(const Status& s) const;
Shouldn't be virtual as it's not participating in the class hierarchy at all.


Line 240:   DataDir* mutable_data_dir() const {
Just combine into HandleError().


Line 386:   return !close.ok() ? close : sync;
Maybe do a single HandleError() down here instead?


Line 417:   virtual void HandleError(const Status& s) const;
Not virtual.


Line 419:   DataDir* mutable_data_dir() const {
This is only called by HandleError(); how about moving its implementation into 
HandlError() itself?


Line 482:   return reader_->Size(sz);
Can't this fail with an IO error too?


Line 539:       if (PREDICT_FALSE(IsDiskFailure(env_->SyncDir(s)))) {
We're no longer returning early if one SyncDir() fails. I don't necessarily 
disagree with that change, just wanted to make sure you made it intentionally.


Line 649:     if (PREDICT_FALSE(IsDiskFailure(s))) {
For this failure (and/or the next one), should we try a new data directory 
instead of returning the error?


Line 671:   if (PREDICT_FALSE(IsDiskFailure(s))) {
Shouldn't this (and the above) use HANDLE_DISK_FAILURE?


Line 700:     return Status::OK();
Shouldn't we return an actual error though? If we return OK, the tablet flush 
code will remove this block from the list of deleted blocks and won't retry its 
deletion, orphaning it.


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

Line 498: }
> Since this is a static method, the handling wraps the Create() method. With
What's missing? Open() is also a static method and you've managed to insert the 
failure handling into it.

Or, alternatively, if that results in too many handling sites, perhaps move the 
handling so it wraps Open() too? Then they're at least consistent.


http://gerrit.cloudera.org:8080/#/c/7030/14/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

PS14, Line 682:   // Handle any other errors, e.g. disk failures.
              :   HandleError(read_status);
Nit: move this under the "If we've made it here..." comment, since it applies 
to the HandleError() call too.


Line 794:   RETURN_NOT_OK(block_manager()->SyncContainer(*this));
RETURN_NOT_OK_HANDLE_ERROR?

Or, alternatively, add failure handling within SyncContainer() itself.


Line 1095:   virtual void HandleError(const Status& s) const {
Shouldn't be virtual.

On second look, I'm not seeing where HandleError() or mutable_data_dir() are 
called.


Line 1253:       RETURN_NOT_OK_PREPEND(AppendMetadata(), "Unable to flush block 
during close");
This is wrong, look at the comment on L1234-1236.


Line 1309:   virtual void HandleError(const Status& s) const {
Not virtual.


PS14, Line 1309:   virtual void HandleError(const Status& s) const {
               :     container_->HandleError(s);
               :   }
               : 
               :   DataDir* mutable_data_dir() {
               :     return container_->mutable_data_dir();
               :   }
Same; where is this used?


Line 1517:       bool is_on_ext4;
Bad conflict resolution here.


Line 1745:   RETURN_NOT_OK_PREPEND(s, "Could not create new log block container 
at " + dir->dir());
Hmm, but we could try a new data directory right?


http://gerrit.cloudera.org:8080/#/c/7030/14/src/kudu/tablet/mvcc.cc
File src/kudu/tablet/mvcc.cc:

Line 93:   CHECK_EQ(RESERVED, old_state) << "transaction with timestamp " << 
timestamp.ToString()
Belongs in the patch dealing with transactions?


http://gerrit.cloudera.org:8080/#/c/7030/14/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 1094:     LOG(WARNING) << Substitute("Tablet $0 is located on an unhealthy 
data dir.", tablet_id);
Likewise, the changes to this file don't seem like they belong in this patch.


http://gerrit.cloudera.org:8080/#/c/7030/14/src/kudu/util/status.h
File src/kudu/util/status.h:

PS14, Line 466: As to not
              : // break ABI-compatability of the Status class, rather than 
changing the Status
              : // class, we use this function to determine disk failures.
Adding a new member function to Status doesn't break ABI compatibility, so if 
you'd rather do that, go for it.


-- 
To view, visit http://gerrit.cloudera.org:8080/7030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c
Gerrit-PatchSet: 14
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

Reply via email to