Andrew Wong has posted comments on this change. Change subject: disk failure: coordinate error handling ......................................................................
Patch Set 16: (22 comments) http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/block_manager.h File src/kudu/fs/block_manager.h: Line 199: > Since this is a raw pointer, you should say something about lifespans here. Done http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/error_manager.h File src/kudu/fs/error_manager.h: Line 32: typedef Callback<void(const std::string&)> ErrorNotificationCb; > Hmm, why must this be defined here and in fs_manager.h? Surely we could def Done, keeping it here and putting error_manager.h in fs_manager.h. Line 36: // multiple layers, many of which we prefer to keep separate. The FsErrorManager > These macros probably don't belong in this patch; I don't see them used at Done PS15, Line 62: > What does "blocks" mean here? If this is ReadableBlock/WritableBlock, there Done Line 66: } // namespace kudu > Nit: separate on each line. Done Line 82 > This is a pretty meaty method; why not move it to a .cc file? Moved all the logic out of the error manager. Line 100 > Should this be a CHECK/DCHECK? Under what circumstances would you expect th You're right that this should never happen. Done PS15, Line 106: : : > This is too much cross-layer mingling. Reword it so it's scoped to only how Done http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/file_block_manager.cc File src/kudu/fs/file_block_manager.cc: Line 388: FileReadableBlock(const FileBlockManager* block_manager, BlockId block_id, > Why did you have to change this? And below? For a future patch, it was needed to set the DataDirManager in the FsErrorManager. This isn't a requirement anymore so I'll revert it. Line 544: "file_block_manager", > I don't particularly care for this. Why not hoist the DataDirManager out of I've restructured this a bit so the error manager doesn't need to know about any external state other than the callbacks. I do intend on moving the DataDirManager out of the BlockManager (and also likely renaming it to DirectoryManager or something), but in another patch. http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/file_block_manager.h File src/kudu/fs/file_block_manager.h: Line 106 > FWIW, this is a style that I employ, reminiscent of Javadoc, where the logi I see, I've seen this style around, just not with such short sections. Reverting Line 98: > Declare this virtual in BlockManager and override it here. Same for log_blo Done http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/fs_manager.h File src/kudu/fs/fs_manager.h: Line 49: > I've only finished reviewing this patch (and not the rest) so I don't even The key, I think, was pushing all of the logic into the clearly-defined layers (e.g. fail the tablets in the TSTabletManager instead of here). Doing so also opens it up for a bit more flexibility (using a string instead of set<string> as an input to this cb). Line 49: > did you consider a design where the callback just notifies the failure by U Done Line 124: // This callback will be run when block operations run into disk failures. > Should also explain what this means. When will this callback be invoked? Wh Done PS15, Line 125: e input to the callback > think this would be better SetErrorNotificationCallback to match the type Done http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/fs/log_block_manager.h File src/kudu/fs/log_block_manager.h: Line 365: > Add a little comment here. Done http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: PS15, Line 164: // Search for tablets in the metadata dir. : vector<string> tablet_ids; > it seems a bit messy to have this being "wired" from the constructor. usual Done, moved to the wiring to TabletServer Start()/Shutdown() PS15, Line 175: int loaded_count = 0; : for (const string& tablet_id : tablet_ids) { > same, seems a little not-quite-right here Done Line 611: TabletDataState data_state = replica->tablet_metadata()->tablet_data_state(); > Did you intend for this decomposition to be in this patch? Since there's st Hrm, yeah originally there was a usage that was very specific to disk failure handling that's in a separate patch. Will revert. http://gerrit.cloudera.org:8080/#/c/7029/15/src/kudu/tserver/ts_tablet_manager.h File src/kudu/tserver/ts_tablet_manager.h: PS15, Line 132: scoped_refptr<tablet::TabletReplica>* replica) const; : : // Same as LookupTablet but doesn't acquired the shared lock. : bool LookupTabletUnlocked(const std::string& tablet_id, : scoped_refptr<tablet::TabletReplica>* replica) const; : : virtual Status GetTabletReplica(const std::string& tablet_id, > should this be private? Done PS15, Line 197: > the prototype indicates it operates on a bunch of replicas at once Restructured the calls. -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 16 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
