Andrew Wong has posted comments on this change.

Change subject: disk failure: error-handling macros in blocks
......................................................................


Patch Set 17:

(26 comments)

http://gerrit.cloudera.org:8080/#/c/7030/14/src/kudu/fs/data_dirs.h
File src/kudu/fs/data_dirs.h:

Line 268:   DataDir* FindDataDirByUuidIndex(uint16_t uuid_idx) const;
> Mistake?
Done


http://gerrit.cloudera.org:8080/#/c/7030/14/src/kudu/fs/error_manager.h
File src/kudu/fs/error_manager.h:

PS14, Line 48: 
             :   return s
> The handler is run on any failure, not just disk failures.
Changed this to call on disk failures, which I found to actually be useful.


http://gerrit.cloudera.org:8080/#/c/7030/14/src/kudu/fs/file_block_manager.cc
File src/kudu/fs/file_block_manager.cc:

Line 238:   void HandleError(const Status& s) const;
> Shouldn't be virtual as it's not participating in the class hierarchy at al
Done


Line 240:  private:
> Just combine into HandleError().
Done


Line 386: 
> Maybe do a single HandleError() down here instead?
I left this as two separate calls error-handling calls (e.g. if they have 
different errors), but I did move them down here.


Line 417:  private:
> Not virtual.
Done


Line 419:   FileBlockManager* block_manager_;
> This is only called by HandleError(); how about moving its implementation i
Done


Line 482:   return ReadV(offset, &results);
> Can't this fail with an IO error too?
Done


Line 539:   return Status::OK();
> We're no longer returning early if one SyncDir() fails. I don't necessarily
This was unintentional, thanks for checking. Here I think it'd be best to keep 
the earlier behavior but with the error handling callback.


Line 649:   if (s.ok()) {
> For this failure (and/or the next one), should we try a new data directory 
I'm not sure what that would buy us since at that point, we already know a 
slice of the tablet is effectively unusable and the tablet will be shut down.


Line 671:       
error_manager_->RunErrorNotificationCb(dd_manager()->FindDataDirByUuidIndex( \
> Shouldn't this (and the above) use HANDLE_DISK_FAILURE?
Done


Line 700: 
> Shouldn't we return an actual error though? If we return OK, the tablet flu
Right. Actually I'll wrap file_cache_.DeleteFile() with a macro, since it 
should be caught there anyway. Same with the above.


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

Line 498: 
> What's missing? Open() is also a static method and you've managed to insert
Fair point. A couple more macros should do it.


http://gerrit.cloudera.org:8080/#/c/7030/14/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

PS14, Line 682:   HandleError(read_status);
              :   return read_status;
> Nit: move this under the "If we've made it here..." comment, since it appli
Done


Line 794:   scoped_refptr<LogBlock> lb = block_manager()->AddLogBlock(
> RETURN_NOT_OK_HANDLE_ERROR?
Done. I put it in SyncContainer.


Line 1095:   }
> Shouldn't be virtual.
HandleError() is called any place with RETURN_NOT_OK_HANDLE_ERROR(). You're 
right that mutable_data_dir() isn't needed anymore.


Line 1253:       VLOG(3) << "Syncing block " << id();
> This is wrong, look at the comment on L1234-1236.
Ah I see, thanks for pointing that out. Done


Line 1309:   // The owning container. Must outlive this block.
> Not virtual.
Done


PS14, Line 1309:   // The owning container. Must outlive this block.
               :   LogBlockContainer* container_;
               : 
               :   // A reference to this block's metadata.
               :   scoped_refptr<internal::LogBlock> log_block_;
               : 
               :   /
> Same; where is this used?
Done


Line 1517:         if (untested_block_size) {
> Bad conflict resolution here.
Done


Line 1745: }
> Hmm, but we could try a new data directory right?
This still indicates that part of the tablet's data is probably lost due to 
disk failure, so, in the context of disk failures, I think it makes sense to 
error out here and shut down the tablet.


http://gerrit.cloudera.org:8080/#/c/7030/14/src/kudu/tablet/mvcc.cc
File src/kudu/tablet/mvcc.cc:

Line 93:   CHECK_EQ(old_state, RESERVED) << "transaction with timestamp " << 
timestamp.ToString()
> Belongs in the patch dealing with transactions?
Done


http://gerrit.cloudera.org:8080/#/c/7030/14/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 1094
> Likewise, the changes to this file don't seem like they belong in this patc
Done


http://gerrit.cloudera.org:8080/#/c/7030/15/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 602:       return Status::NotFound("Tablet not found", tablet_id);
> error: no viable conversion from 'scoped_refptr<kudu::tablet::TabletReplica
Done


Line 607:       *error_code = TabletServerErrorPB::TABLET_NOT_RUNNING;
> error: use of undeclared identifier 'reason' [clang-diagnostic-error]
Done


http://gerrit.cloudera.org:8080/#/c/7030/14/src/kudu/util/status.h
File src/kudu/util/status.h:

PS14, Line 466: 
              : 
              : inline Status& Status::operator=(Status&& s) {
> Adding a new member function to Status doesn't break ABI compatibility, so 
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c
Gerrit-PatchSet: 17
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

Reply via email to