Alexey Serbin has uploaded this change for review. (
http://gerrit.cloudera.org:8080/16934
Change subject: [util] fix another lock contention in MaintenanceManager
......................................................................
[util] fix another lock contention in MaintenanceManager
I had a chance to look at stack traces in the diagnostic log files at
a Kudu cluster with high data ingest ratio. There were many thread
stack snapshots, and the pattern below presented in every snapshot
in a row for several minutes.
tids=[4016]
0x7f64f36b05e0 <unknown>
0xa116c6 kudu::tablet::BudgetedCompactionPolicy::RunApproximation()
0xa129c9 kudu::tablet::BudgetedCompactionPolicy::PickRowSets()
0x9c8d80 kudu::tablet::Tablet::UpdateCompactionStats()
0x9ec848 kudu::tablet::CompactRowSetsOp::UpdateStats()
0x1b3de5c kudu::MaintenanceManager::FindBestOp()
0x1b3f3c5 kudu::MaintenanceManager::RunSchedulerThread()
0x1b86014 kudu::Thread::SuperviseThread()
0x7f64f36a8e25 start_thread
0x7f64f176f34d __clone
tids=[48325,48324,48323]
0x7f64f36b05e0 <unknown>
0x7f64f36af42b __lll_lock_wait
0x7f64f36aadcb _L_lock_812
0x7f64f36aac98 __GI___pthread_mutex_lock
0x1b546fd kudu::Mutex::Acquire()
0x1b42913 kudu::MaintenanceManager::LaunchOp()
0x1b929cd kudu::FunctionRunnable::Run()
0x1b8fa87 kudu::ThreadPool::DispatchThread()
0x1b86014 kudu::Thread::SuperviseThread()
0x7f64f36a8e25 start_thread
0x7f64f176f34d __clone
It seems thread 4016 above had acquired the MaintenanceManager::lock_
mutex and went calculating the scores for compaction candidates. Three
other threads 48325, 48324, 48323 are waiting for the same mutex to be
acquired upon returning from the MaintenanceManager::LaunchOp() method.
Basically, these three threads are blocked while scheduling maintenance
operations called by MaintenanceManager::RunSchedulerThread() task.
To relieve the contention, I updated the code to use separate mutexes
for op-specific condition variable and the scheduler's condition
variable. Now, op-specific condition variable uses the
MaintenanceManager::running_instances_lock_ (it also used to guard
access to the MaintenanceManager::running_instances_ container.
This patch also fixes reporting on the duration of the compaction
operations. Before this patch, the timings for compaction operations
might be getting extra deltas in case of high lock contention,
especially in cases shown by the stacks captured above.
Change-Id: I63b12dd3641ef655f8fcbbad8d8ac515d874c0fb
---
M src/kudu/util/maintenance_manager.cc
M src/kudu/util/maintenance_manager.h
2 files changed, 143 insertions(+), 115 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/34/16934/1
--
To view, visit http://gerrit.cloudera.org:8080/16934
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I63b12dd3641ef655f8fcbbad8d8ac515d874c0fb
Gerrit-Change-Number: 16934
Gerrit-PatchSet: 1
Gerrit-Owner: Alexey Serbin <[email protected]>