[kudu-CR] error manager: synchronize/serialize handling

Mike Percy (Code Review) Mon, 13 Nov 2017 19:09:03 -0800

Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8395 )


Change subject: error_manager: synchronize/serialize handling
......................................................................


Patch Set 5:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8395/5/src/kudu/fs/error_manager.h
File src/kudu/fs/error_manager.h:

http://gerrit.cloudera.org:8080/#/c/8395/5/src/kudu/fs/error_manager.h@96
PS5, Line 96:   FsErrorManager() :
I find the error handling semantics confusing and complicated and I think we 
can avoid so many call sites to the error handling. I wrote this out but it was 
too much for Gerrit, so please see this Gist for how I think this could work: 
https://gist.github.com/mpercy/eed419cc4bd2da23cd93ddf9ec51e5d9

The idea behind the above is that the lowest level possible reports the errors 
to the central error reporter, while we decide what to do at the highest levels.

I have only addressed disk handling in that gist, but this generic approach 
also works for so-called "tablet" errors. Really I think they are fundamentally 
different:

1. actual disk failures - these are detected at a very low level when they are 
reported from POSIX calls.
2. invariant violations - this is corruption / mismatched checksums, as well as 
missing files, etc. These are not so much reported as actively checked for.

I think the above are different enough that we probably need 2 different types 
of callbacks but I think it makes sense to invoke them very differently, where 
we actively invoke "tablet" callbacks when we do these kinds of sanity checks 
but block manager code never reports the disk errors because it's all taken 
care of in env_posix.cc

LMK what you think.


http://gerrit.cloudera.org:8080/#/c/8395/4/src/kudu/fs/error_manager.h
File src/kudu/fs/error_manager.h:

http://gerrit.cloudera.org:8080/#/c/8395/4/src/kudu/fs/error_manager.h@80
PS4, Line 80:
> Ah, I see. Yeah, maybe putting it in util/status.h was the right call then
I see. Personally I find it confusing for two reasons: we call it 
RETURN_NOT_OK_CALL() when maybe we really mean RETURN_NOT_OK_EVAL(). So 
semantically it's a little tricky. 2nd, there is a danger of someone doing 
something like: RETURN_NOT_OK_CALL(DoSomething(), [&] { 
my_obj->invoke_err_callback(); }) because it will compile and run but the 
callback will never get called. We will probably get a compiler warning, but 
that's it. I think it would be less surprising to require the 2nd argument to 
be callable and make the usage I just noted the required idiom.



--
To view, visit http://gerrit.cloudera.org:8080/8395
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie61c408a0b4424f933f40a31147568c2f906be0e
Gerrit-Change-Number: 8395
Gerrit-PatchSet: 5
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mpe...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-Comment-Date: Tue, 14 Nov 2017 03:07:57 +0000
Gerrit-HasComments: Yes

[kudu-CR] error manager: synchronize/serialize handling

Reply via email to