[kudu-CR] WIP disk failure: coordinate disk failure handling

Andrew Wong (Code Review) Fri, 07 Jul 2017 11:33:07 -0700

Andrew Wong has posted comments on this change.

Change subject: WIP disk failure: coordinate disk failure handling
......................................................................



Patch Set 9:

(23 comments)

http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/block_manager.h
File src/kudu/fs/block_manager.h:

Line 141:   virtual void HandleError(const Status& s, DataDir* dir) const = 0;
> Why does HandleError() need to be part of the external API to WritableBlock
Done


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/error_manager.h
File src/kudu/fs/error_manager.h:

Line 45:   RETURN_NOT_OK(s_); \
> Wrong patch, but shouldn't this be just 'return s'? We've already establish
Done


Line 56:   RETURN_NOT_OK(s_); \
> Same.
Done


Line 61: #define HANDLE_DISK_FAILURE(status_expr, err_handler) do { \
> Since this is so similar to RETURN_NOT_OK_HANDLE would prefer if the macro 
I would agree, but RETURN_NOT_OK_HANDLE and RETURN_NOT_OK_HANDLE_ERROR are both 
non-specific to disk failures.


Line 64:       (err_handler); \
> Nit: indentation
Done


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/file_block_manager.cc
File src/kudu/fs/file_block_manager.cc:

Line 370:     if (PREDICT_FALSE(IsDiskFailure(sync))) {
> Shouldn't these be HANDLE_DISK_FAILURE calls instead?
HandleError() should be sufficient (without the PREDICT_FALSE here)


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

Line 399:   void HandleError(const Status& s, DataDir* dir) const;
> This isn't an accessor; bring it up into the other section and doc it.
Done


Line 402:   friend class LogWritableBlock;
> Don't need? HandleError() is public after all.
Done


Line 498: Status LogBlockContainer::Create(LogBlockManager* block_manager,
> What about disk failures during container creation?
Since this is a static method, the handling wraps the Create() method. With the 
right input to Create(), this could also be handled in the Create() call itself 
to better align with the idea of only wrapping env calls.


Line 604:   // Open the existing metadata and data files for writing.
> What about failures here?
Done


Line 648:   
RETURN_NOT_OK(block_manager()->env()->NewRandomAccessFile(metadata_path, 
&metadata_reader));
> What about failures here?
Done


Line 656:     read_status = pb_reader.ReadNextPB(&record);
> And here?
Done


Line 858:   return metadata_file_->Append(pb);
> Failures here?
Done


Line 881:     return metadata_file_->Sync();
> What about this one?
Done


Line 886: Status LogBlockContainer::ReopenMetadataWriter() {
> And failures here?
Done


Line 1090:   DataDir* mutable_data_dir() {
> Where is this used?
It was used in calls to HandleError (probably should've been private!). Since 
HandleError() doesn't need a specified dir, I've taken it out.


Line 1305:   DataDir* mutable_data_dir() {
> And this?
Same as above.


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/tablet/delta_tracker.cc
File src/kudu/tablet/delta_tracker.cc:

Line 696:   Status s = FlushDMS(old_dms.get(), &dfr, flush_type);
> Shouldn't we just RETURN_NOT_OK here and drop the CHECK? Isn't the expectat
Done


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/tablet/metadata.proto
File src/kudu/tablet/metadata.proto:

Line 160:   // The Tablet has been stopped after a failure. In this state, like 
FAILED,
> Why do we need a new state for this? Why isn't FAILED sufficient?
When running into a disk failure, there's eventually a call to 
TabletReplica::Shutdown(). The final state of this has indicate that the tablet 
is also shutdown (to ensure Shutdown() remains idempotent).


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/tablet/tablet_replica_mm_ops.cc
File src/kudu/tablet/tablet_replica_mm_ops.cc:

Line 138:     tablet_replica_->tablet()->rowsets_flush_sem_.unlock();
> Can you use an RAII-style guard for the acquisition forof rowsets_flush_sem
Hrm, I suppose it could pass rowsets_flush_sem_ ownership to some ScopedLock.


Line 196:                             max_idx_to_replay_size);
> Nit: indentation.
Done


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/tserver/tserver.proto
File src/kudu/tserver/tserver.proto:

Line 97:     // The tablet is in a failed state.
> TABLET_NOT_RUNNING is insufficiently descriptive?
This is here to indicate to the leader which tablets that need to be replicated 
(TABLET_FAILED). TABLET_NOT_RUNNING shouldn't trigger replication.


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/util/status.h
File src/kudu/util/status.h:

Line 433: inline bool IsDiskFailure(const Status& s) {
> Would be nice to add a comment to each explaining why it's included.
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

[kudu-CR] WIP disk failure: coordinate disk failure handling

Reply via email to