[ 
https://issues.apache.org/jira/browse/KUDU-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17139506#comment-17139506
 ] 

Todd Lipcon commented on KUDU-3149:
-----------------------------------

In your thread dump it looks like Thread 4 is waiting on another lock. Who's 
holding that lock?

> Lock contention between registering ops and computing maintenance op stats
> --------------------------------------------------------------------------
>
>                 Key: KUDU-3149
>                 URL: https://issues.apache.org/jira/browse/KUDU-3149
>             Project: Kudu
>          Issue Type: Bug
>          Components: perf, tserver
>            Reporter: Andrew Wong
>            Priority: Major
>
> We saw a bunch of tablets bootstrapping extremely slowly, and many stuck 
> supposedly bootstrapping, but not showing up in the {{/tablets}} page, i.e. 
> we could only see INITIALIZED and RUNNING tablets, no BOOTSTRAPPING.
> Upon digging into the stacks, we saw a bunch waiting in:
> {code}
> TID 46583(tablet-open [wo):
>     @     0x7f1dd57147e0  (unknown)
>     @     0x7f1dd5713332  (unknown)
>     @     0x7f1dd570e5d8  (unknown)
>     @     0x7f1dd570e4a7  (unknown)
>     @          0x23b4058  kudu::Mutex::Acquire()
>     @          0x23980ff  kudu::MaintenanceManager::RegisterOp()
>     @           0xb85374  
> kudu::tablet::TabletReplica::RegisterMaintenanceOps()
>     @           0xa0055b  kudu::tserver::TSTabletManager::OpenTablet()
>     @          0x23f994c  kudu::ThreadPool::DispatchThread()
>     @          0x23f3f8b  kudu::Thread::SuperviseThread()
>     @     0x7f1dd570caa1  (unknown)
>     @     0x7f1dd3b18bcd  (unknown)
> {code}
> and upon further inspection, the lock being held is taken by the MM scheduler 
> thread here:
> {code}
> Thread 4 (Thread 0x7f1d7d358700 (LWP 46999)):
> #0  0x00007f1dd5713334 in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x00007f1dd570e5d8 in _L_lock_854 () from /lib64/libpthread.so.0
> #2  0x00007f1dd570e4a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #3  0x0000000000b51f29 in 
> kudu::tablet::Tablet::UpdateCompactionStats(kudu::MaintenanceOpStats*) ()
> #4  0x0000000000b7f435 in 
> kudu::tablet::CompactRowSetsOp::UpdateStats(kudu::MaintenanceOpStats*) ()
> #5  0x00000000023956e4 in kudu::MaintenanceManager::FindBestOp() ()
> #6  0x0000000002396af9 in 
> kudu::MaintenanceManager::FindAndLaunchOp(std::unique_lock<kudu::Mutex>*) ()
> #7  0x0000000002397858 in kudu::MaintenanceManager::RunSchedulerThread() ()
> {code}
> It seems like we're holding the maintenance manager's {{lock_}} member, for 
> the duration of us computing stats, which is contending with the registration 
> of maintenance manager ops. The scheduler thread is thus effectively blocking 
> the registration of many tablet replicas' ops, and blocking further 
> bootstrapping.
> A couple things come to mind:
> - We could probably take a snapshot of the ops under lock and release the 
> lock_ when finding the best op to run.
> - Additionally, we may want to consider disabling compactions entirely until 
> the initial set of tablets finishes bootstrapping.
> It's worth noting that it isn't the act of compacting that is contending 
> here, but rather the computation of the stats.
> As a workaround, we used the {{set_flag}} tool to disable compactions on the 
> node and noted significantly faster bootstrapping.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to