David Ribeiro Alves has posted comments on this change.

Change subject: disk failure: coordinate fs error handling
......................................................................


Patch Set 3:

(4 comments)

think that this patch (no tests) and https://gerrit.cloudera.org/#/c/7030/3 
should be merged and have simple unit tests (you can add your large itest 
separately)

I think the interface to handle EIOs is weird. If I understood correctly the 
purpose is to make sure that the method exists for the EIO handling macro, but 
the macro doesn't need it to be an interface, i.e. you can keep the method and 
not have an interface and the macro will work all the same.

so basically you have three levels of entities, the "block" files (which can 
get EIOs from Env) the block managers (and additionally the block containers 
for the log block manager) and the fs manager.

Here's a rough sketch of an alternative idea:
Have the multiple file*block impls have two methods (that don't need to be in 
an interface since they are not used externally)

DataDir* get_data_dir();
void HandleEIO(DataDir* data_dir);

The macro will call the second method with the result of the the first as 
argument.

Then, instead of all being aware the error_managed, each passes the error to 
the layer above, i.e. the file*block* passes the error to the block_manager 
which passes the error to the fs_manager, which finally passes the error to the 
error manager.

I might be missing some details that make this a non-starter, please let me 
know.

It would also be good to pass most (if not all stuff) at ctor time and not have 
branching testing whether certain things are set.

http://gerrit.cloudera.org:8080/#/c/7029/3/src/kudu/fs/block_manager.h
File src/kudu/fs/block_manager.h:

PS3, Line 26: #include <set>
include ordering


http://gerrit.cloudera.org:8080/#/c/7029/3/src/kudu/fs/error_manager.h
File src/kudu/fs/error_manager.h:

PS3, Line 27: EioHandleableImpl
weird to be calling an interface "impl". also not sure about the interface 
itself, who uses this "contract" (and what is the contract, contracts are 
supposed to be externally facing things whereas this is used internally to 
where the EIOs happen)?


http://gerrit.cloudera.org:8080/#/c/7029/3/src/kudu/fs/file_block_manager.cc
File src/kudu/fs/file_block_manager.cc:

PS3, Line 685: LOG(WARNING) << Substitute("Block $0 is on a disk that has 
failed: $1",
             :                                block_id.ToString(),
             :                                
dd_manager_.FindDataDirByUuidIndex(uuid_idx)->dir());
isn't this gonna spam the log if we delete a whole tablet?


http://gerrit.cloudera.org:8080/#/c/7029/3/src/kudu/fs/file_block_manager.h
File src/kudu/fs/file_block_manager.h:

PS3, Line 77: FsErrorManager* error_manager = nullptr
why allow this to be null? is that because of tests?


-- 
To view, visit http://gerrit.cloudera.org:8080/7029
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3169deada702c527b70fabf9d2223364f9a9ea6
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

Reply via email to