maintenance_manager: fix a deadlock on shutdown The shutdown sequence of the tablet server first shuts down the maintenance manager and then calls Unregister() on the registered ops.
This produced a potential hang on shutdown, since the 'Shutdown()' call could run at the same time that some maintenance ops were waiting on the thread_pool_ queue. Those waiting functions would be removed from the queue silently. We depend on the functions running to decrement the 'running_' count of the associated op, so when they were removed silently, the 'Unregister()' call could block forever waiting for the 'running_' count to go to 0. This caused a timeout of about 0.5% of runs of the new stop-tablet-itest 'TestShutdownWhileWriting' test case. With this fix, no runs time out. Change-Id: Icaf864299bfd43212bc9655f48128851b9c1d59b Reviewed-on: http://gerrit.cloudera.org:8080/8616 Tested-by: Kudu Jenkins Reviewed-by: Mike Percy <[email protected]> Project: http://git-wip-us.apache.org/repos/asf/kudu/repo Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/36a5e766 Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/36a5e766 Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/36a5e766 Branch: refs/heads/master Commit: 36a5e7669551d0f79c340eadb02ce8d51defb590 Parents: 63b53ad Author: Todd Lipcon <[email protected]> Authored: Mon Nov 20 23:05:40 2017 -0800 Committer: Todd Lipcon <[email protected]> Committed: Tue Nov 21 18:07:54 2017 +0000 ---------------------------------------------------------------------- src/kudu/util/maintenance_manager.cc | 5 +++++ 1 file changed, 5 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/kudu/blob/36a5e766/src/kudu/util/maintenance_manager.cc ---------------------------------------------------------------------- diff --git a/src/kudu/util/maintenance_manager.cc b/src/kudu/util/maintenance_manager.cc index 8984de8..50d792e 100644 --- a/src/kudu/util/maintenance_manager.cc +++ b/src/kudu/util/maintenance_manager.cc @@ -183,6 +183,11 @@ void MaintenanceManager::Shutdown() { if (monitor_thread_.get()) { CHECK_OK(ThreadJoiner(monitor_thread_.get()).Join()); monitor_thread_.reset(); + // Wait for all the running and queued tasks before shutting down. Otherwise, + // Shutdown() can remove a queued task silently. We count on eventually running the + // queued tasks to decrement their "running" count, which is incremented at the time + // they are enqueued. + thread_pool_->Wait(); thread_pool_->Shutdown(); } }
