maintenance_manager: fix a deadlock on shutdown

The shutdown sequence of the tablet server first shuts down the maintenance
manager and then calls Unregister() on the registered ops.

This produced a potential hang on shutdown, since the 'Shutdown()' call could
run at the same time that some maintenance ops were waiting on the thread_pool_
queue. Those waiting functions would be removed from the queue silently. We
depend on the functions running to decrement the 'running_' count of the 
associated
op, so when they were removed silently, the 'Unregister()' call could block 
forever
waiting for the 'running_' count to go to 0.

This caused a timeout of about 0.5% of runs of the new stop-tablet-itest
'TestShutdownWhileWriting' test case. With this fix, no runs time out.

Change-Id: Icaf864299bfd43212bc9655f48128851b9c1d59b
Reviewed-on: http://gerrit.cloudera.org:8080/8616
Tested-by: Kudu Jenkins
Reviewed-by: Mike Percy <[email protected]>


Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/36a5e766
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/36a5e766
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/36a5e766

Branch: refs/heads/master
Commit: 36a5e7669551d0f79c340eadb02ce8d51defb590
Parents: 63b53ad
Author: Todd Lipcon <[email protected]>
Authored: Mon Nov 20 23:05:40 2017 -0800
Committer: Todd Lipcon <[email protected]>
Committed: Tue Nov 21 18:07:54 2017 +0000

----------------------------------------------------------------------
 src/kudu/util/maintenance_manager.cc | 5 +++++
 1 file changed, 5 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kudu/blob/36a5e766/src/kudu/util/maintenance_manager.cc
----------------------------------------------------------------------
diff --git a/src/kudu/util/maintenance_manager.cc 
b/src/kudu/util/maintenance_manager.cc
index 8984de8..50d792e 100644
--- a/src/kudu/util/maintenance_manager.cc
+++ b/src/kudu/util/maintenance_manager.cc
@@ -183,6 +183,11 @@ void MaintenanceManager::Shutdown() {
   if (monitor_thread_.get()) {
     CHECK_OK(ThreadJoiner(monitor_thread_.get()).Join());
     monitor_thread_.reset();
+    // Wait for all the running and queued tasks before shutting down. 
Otherwise,
+    // Shutdown() can remove a queued task silently. We count on eventually 
running the
+    // queued tasks to decrement their "running" count, which is incremented 
at the time
+    // they are enqueued.
+    thread_pool_->Wait();
     thread_pool_->Shutdown();
   }
 }

Reply via email to