Hello Tidy Bot, Mike Percy, Kudu Jenkins, Adar Dembo, Todd Lipcon, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/8395 to look at the new patch set (#9). Change subject: error_manager: synchronize/serialize handling ...................................................................... error_manager: synchronize/serialize handling The FsErrorManager is helper infrastructure that other classes can use to help provide API contracts related to error handling. Here is an example error-handling contract provided to the TSTabletManager in a future patch, if a disk failure Status is returned during tablet operation, either: 1) the tablet server will crash and we can rely on Kudu's crash-consistency mechanisms for safety, or 2) any affected Tablet will transition to the 'kStopped' state via an error manager callback. Most components will simply pass the non-OK Status up the call chain. However, if a method expects an IO operation to succeed for correctness purposes, and it receives an error Status, then it should check that the Tablet is stopped. If so, it can be assumed that the error was handled. Otherwise, Kudu must rely on the crash-consistency mechanisms and crash. Given the above contract, The state of a tablet server post-disk-failure depends significantly on the completion of disk-failure-handling callbacks. Error-handling _must_ finish before anything is propagated back to the offending caller. To ensure correct completion of error-handling even in a concurrent setting, this patch extends the error manager with a mutex and a second error-handling callback type. This ensures that errors indirectly caused by disk failures can be handled by non-disk-specific handling, serializing failure-handling in the same fashion. As an example of where this is necessary, say a tablet has data in a single directory and hits a bad disk. That directory is immediately marked failed and handling starts to fail all tablets in the directory. Before, if the tablet were to create a new block before being failed, it would fail immediately, complaining that no directories are available, and would eventually fail a CHECK that translates roughly to: "Has error handling for this tablet completed?" By wrapping block creation with tablet-specific error handling and with serialized error-handling, this CHECK will pass. A previous revision accomplished this by using a single-threaded threadpool to trigger error-handling and ensuring completion of error-handling by waiting on the threadpool. I ultimately went with this implentation since it's a bit more straight-forward wrt when to make the different calls (i.e. "trigger disk error-handling vs trigger tablet error-handling", instead of "trigger error-handling vs wait for error-handling to finish"). This also has the benefit of being extendable for other kinds of errors in the future. Change-Id: Ie61c408a0b4424f933f40a31147568c2f906be0e --- M src/kudu/fs/CMakeLists.txt M src/kudu/fs/data_dirs-test.cc A src/kudu/fs/error_manager-test.cc A src/kudu/fs/error_manager.cc M src/kudu/fs/error_manager.h M src/kudu/fs/file_block_manager.cc M src/kudu/fs/fs_manager.cc M src/kudu/fs/fs_manager.h M src/kudu/fs/log_block_manager-test.cc M src/kudu/fs/log_block_manager.cc M src/kudu/tserver/tablet_server.cc M src/kudu/util/status.h 12 files changed, 378 insertions(+), 58 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/95/8395/9 -- To view, visit http://gerrit.cloudera.org:8080/8395 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ie61c408a0b4424f933f40a31147568c2f906be0e Gerrit-Change-Number: 8395 Gerrit-PatchSet: 9 Gerrit-Owner: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <mpe...@apache.org> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <t...@apache.org>