David Ribeiro Alves has posted comments on this change. Change subject: disk failure: coordinate fs error handling ......................................................................
Patch Set 3: (4 comments) think that this patch (no tests) and https://gerrit.cloudera.org/#/c/7030/3 should be merged and have simple unit tests (you can add your large itest separately) I think the interface to handle EIOs is weird. If I understood correctly the purpose is to make sure that the method exists for the EIO handling macro, but the macro doesn't need it to be an interface, i.e. you can keep the method and not have an interface and the macro will work all the same. so basically you have three levels of entities, the "block" files (which can get EIOs from Env) the block managers (and additionally the block containers for the log block manager) and the fs manager. Here's a rough sketch of an alternative idea: Have the multiple file*block impls have two methods (that don't need to be in an interface since they are not used externally) DataDir* get_data_dir(); void HandleEIO(DataDir* data_dir); The macro will call the second method with the result of the the first as argument. Then, instead of all being aware the error_managed, each passes the error to the layer above, i.e. the file*block* passes the error to the block_manager which passes the error to the fs_manager, which finally passes the error to the error manager. I might be missing some details that make this a non-starter, please let me know. It would also be good to pass most (if not all stuff) at ctor time and not have branching testing whether certain things are set. http://gerrit.cloudera.org:8080/#/c/7029/3/src/kudu/fs/block_manager.h File src/kudu/fs/block_manager.h: PS3, Line 26: #include <set> include ordering http://gerrit.cloudera.org:8080/#/c/7029/3/src/kudu/fs/error_manager.h File src/kudu/fs/error_manager.h: PS3, Line 27: EioHandleableImpl weird to be calling an interface "impl". also not sure about the interface itself, who uses this "contract" (and what is the contract, contracts are supposed to be externally facing things whereas this is used internally to where the EIOs happen)? http://gerrit.cloudera.org:8080/#/c/7029/3/src/kudu/fs/file_block_manager.cc File src/kudu/fs/file_block_manager.cc: PS3, Line 685: LOG(WARNING) << Substitute("Block $0 is on a disk that has failed: $1", : block_id.ToString(), : dd_manager_.FindDataDirByUuidIndex(uuid_idx)->dir()); isn't this gonna spam the log if we delete a whole tablet? http://gerrit.cloudera.org:8080/#/c/7029/3/src/kudu/fs/file_block_manager.h File src/kudu/fs/file_block_manager.h: PS3, Line 77: FsErrorManager* error_manager = nullptr why allow this to be null? is that because of tests? -- To view, visit http://gerrit.cloudera.org:8080/7029 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
