Andrew Wong has posted comments on this change. Change subject: WIP disk failure: coordinate disk failure handling ......................................................................
Patch Set 9: (23 comments) http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/block_manager.h File src/kudu/fs/block_manager.h: Line 141: virtual void HandleError(const Status& s, DataDir* dir) const = 0; > Why does HandleError() need to be part of the external API to WritableBlock Done http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/error_manager.h File src/kudu/fs/error_manager.h: Line 45: RETURN_NOT_OK(s_); \ > Wrong patch, but shouldn't this be just 'return s'? We've already establish Done Line 56: RETURN_NOT_OK(s_); \ > Same. Done Line 61: #define HANDLE_DISK_FAILURE(status_expr, err_handler) do { \ > Since this is so similar to RETURN_NOT_OK_HANDLE would prefer if the macro I would agree, but RETURN_NOT_OK_HANDLE and RETURN_NOT_OK_HANDLE_ERROR are both non-specific to disk failures. Line 64: (err_handler); \ > Nit: indentation Done http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/file_block_manager.cc File src/kudu/fs/file_block_manager.cc: Line 370: if (PREDICT_FALSE(IsDiskFailure(sync))) { > Shouldn't these be HANDLE_DISK_FAILURE calls instead? HandleError() should be sufficient (without the PREDICT_FALSE here) http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/log_block_manager.cc File src/kudu/fs/log_block_manager.cc: Line 399: void HandleError(const Status& s, DataDir* dir) const; > This isn't an accessor; bring it up into the other section and doc it. Done Line 402: friend class LogWritableBlock; > Don't need? HandleError() is public after all. Done Line 498: Status LogBlockContainer::Create(LogBlockManager* block_manager, > What about disk failures during container creation? Since this is a static method, the handling wraps the Create() method. With the right input to Create(), this could also be handled in the Create() call itself to better align with the idea of only wrapping env calls. Line 604: // Open the existing metadata and data files for writing. > What about failures here? Done Line 648: RETURN_NOT_OK(block_manager()->env()->NewRandomAccessFile(metadata_path, &metadata_reader)); > What about failures here? Done Line 656: read_status = pb_reader.ReadNextPB(&record); > And here? Done Line 858: return metadata_file_->Append(pb); > Failures here? Done Line 881: return metadata_file_->Sync(); > What about this one? Done Line 886: Status LogBlockContainer::ReopenMetadataWriter() { > And failures here? Done Line 1090: DataDir* mutable_data_dir() { > Where is this used? It was used in calls to HandleError (probably should've been private!). Since HandleError() doesn't need a specified dir, I've taken it out. Line 1305: DataDir* mutable_data_dir() { > And this? Same as above. http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/tablet/delta_tracker.cc File src/kudu/tablet/delta_tracker.cc: Line 696: Status s = FlushDMS(old_dms.get(), &dfr, flush_type); > Shouldn't we just RETURN_NOT_OK here and drop the CHECK? Isn't the expectat Done http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/tablet/metadata.proto File src/kudu/tablet/metadata.proto: Line 160: // The Tablet has been stopped after a failure. In this state, like FAILED, > Why do we need a new state for this? Why isn't FAILED sufficient? When running into a disk failure, there's eventually a call to TabletReplica::Shutdown(). The final state of this has indicate that the tablet is also shutdown (to ensure Shutdown() remains idempotent). http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/tablet/tablet_replica_mm_ops.cc File src/kudu/tablet/tablet_replica_mm_ops.cc: Line 138: tablet_replica_->tablet()->rowsets_flush_sem_.unlock(); > Can you use an RAII-style guard for the acquisition forof rowsets_flush_sem Hrm, I suppose it could pass rowsets_flush_sem_ ownership to some ScopedLock. Line 196: max_idx_to_replay_size); > Nit: indentation. Done http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/tserver/tserver.proto File src/kudu/tserver/tserver.proto: Line 97: // The tablet is in a failed state. > TABLET_NOT_RUNNING is insufficiently descriptive? This is here to indicate to the leader which tablets that need to be replicated (TABLET_FAILED). TABLET_NOT_RUNNING shouldn't trigger replication. http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/util/status.h File src/kudu/util/status.h: Line 433: inline bool IsDiskFailure(const Status& s) { > Would be nice to add a comment to each explaining why it's included. Done -- To view, visit http://gerrit.cloudera.org:8080/7030 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c Gerrit-PatchSet: 9 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
