Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8395 )
Change subject: error_manager: synchronize/serialize handling ...................................................................... Patch Set 5: (2 comments) http://gerrit.cloudera.org:8080/#/c/8395/5/src/kudu/fs/error_manager.h File src/kudu/fs/error_manager.h: http://gerrit.cloudera.org:8080/#/c/8395/5/src/kudu/fs/error_manager.h@96 PS5, Line 96: FsErrorManager() : I find the error handling semantics confusing and complicated and I think we can avoid so many call sites to the error handling. I wrote this out but it was too much for Gerrit, so please see this Gist for how I think this could work: https://gist.github.com/mpercy/eed419cc4bd2da23cd93ddf9ec51e5d9 The idea behind the above is that the lowest level possible reports the errors to the central error reporter, while we decide what to do at the highest levels. I have only addressed disk handling in that gist, but this generic approach also works for so-called "tablet" errors. Really I think they are fundamentally different: 1. actual disk failures - these are detected at a very low level when they are reported from POSIX calls. 2. invariant violations - this is corruption / mismatched checksums, as well as missing files, etc. These are not so much reported as actively checked for. I think the above are different enough that we probably need 2 different types of callbacks but I think it makes sense to invoke them very differently, where we actively invoke "tablet" callbacks when we do these kinds of sanity checks but block manager code never reports the disk errors because it's all taken care of in env_posix.cc LMK what you think. http://gerrit.cloudera.org:8080/#/c/8395/4/src/kudu/fs/error_manager.h File src/kudu/fs/error_manager.h: http://gerrit.cloudera.org:8080/#/c/8395/4/src/kudu/fs/error_manager.h@80 PS4, Line 80: > Ah, I see. Yeah, maybe putting it in util/status.h was the right call then I see. Personally I find it confusing for two reasons: we call it RETURN_NOT_OK_CALL() when maybe we really mean RETURN_NOT_OK_EVAL(). So semantically it's a little tricky. 2nd, there is a danger of someone doing something like: RETURN_NOT_OK_CALL(DoSomething(), [&] { my_obj->invoke_err_callback(); }) because it will compile and run but the callback will never get called. We will probably get a compiler warning, but that's it. I think it would be less surprising to require the 2nd argument to be callable and make the usage I just noted the required idiom. -- To view, visit http://gerrit.cloudera.org:8080/8395 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ie61c408a0b4424f933f40a31147568c2f906be0e Gerrit-Change-Number: 8395 Gerrit-PatchSet: 5 Gerrit-Owner: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <mpe...@apache.org> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-Comment-Date: Tue, 14 Nov 2017 03:07:57 +0000 Gerrit-HasComments: Yes