[ 
https://issues.apache.org/jira/browse/KUDU-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16421093#comment-16421093
 ] 

Todd Lipcon commented on KUDU-2395:
-----------------------------------

I did a bit of symbolization of the unsymbolized libc symbols in the trace:

- all of the threads inside _nss_* are stuck in __pthread_mutex_lock except for 
one.
- that one is in '_nss_files_gethostbyname2_r' calling 'fgets_unlocked' (I 
guess reading /etc/hosts)

So, essentially, DNS resolution is single-threaded inside libnss and this 
causes a large bottleneck when a bunch of threads need to create proxies during 
a detected failure.

The thing that seems to have precipitated all of this was a short process-wide 
blip:
{code}
W0330 16:29:58.896669  8975 net_util.cc:159] Time spent resolve address for 
vc1310.halxg.cloudera.com: real 1.003s      user 0.001s     sys 0.000s
W0330 16:29:58.899119  4840 thread.cc:554] raft [worker] (thread pool) Time 
spent creating pthread: real 0.967s user 0.001s     sys 0.000s
W0330 16:29:58.899160  4840 thread.cc:521] raft [worker] (thread pool) Time 
spent starting thread: real 0.967s  user 0.001s     sys 0.000s
W0330 16:29:58.899212  4839 thread.cc:554] raft [worker] (thread pool) Time 
spent creating pthread: real 0.983s user 0.000s     sys 0.000s
W0330 16:29:58.899235  4841 thread.cc:554] raft [worker] (thread pool) Time 
spent creating pthread: real 0.954s user 0.001s     sys 0.000s
W0330 16:29:58.899217  4842 thread.cc:554] raft [worker] (thread pool) Time 
spent creating pthread: real 0.973s user 0.000s     sys 0.000s
W0330 16:29:58.899281  4841 thread.cc:521] raft [worker] (thread pool) Time 
spent starting thread: real 0.954s  user 0.001s     sys 0.000s
{code}

This machine is el6 and does have THP enabled, so perhaps khugepaged attacked 
the process and caused the 1sec blip. But then the resulting disaster was 
mostly self-inflicted.

> Thread spike with all threads blocked in libnss
> -----------------------------------------------
>
>                 Key: KUDU-2395
>                 URL: https://issues.apache.org/jira/browse/KUDU-2395
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus, tserver, util
>            Reporter: Todd Lipcon
>            Priority: Major
>
> I saw the thread count on a server under a load test spike from 280 threads 
> (fairly constant) to 3400 threads (briefly). I checked the diagnostics log 
> and found that there are several thousand threads in a stack like:
> {code}
> 0x7facce018606 _nss_files_gethostbyname2_r
>   0x345a703645 <unknown>
>   0x345a6d0b3b <unknown>
>   0x345a6d2d80 <unknown>
>      0x1c9366c kudu::(anonymous namespace)::GetAddrInfo()
>      0x1c95fbe kudu::HostPort::ResolveAddresses()
>       0xac4b78 kudu::consensus::(anonymous 
> namespace)::CreateConsensusServiceProxyForHost()
>       0xac5058 kudu::consensus::RpcPeerProxyFactory::NewProxy()
>       0xb0b212 kudu::consensus::LeaderElection::LeaderElection()
>       0xafab80 kudu::consensus::RaftConsensus::StartElection()
>       0xafd20c kudu::consensus::RaftConsensus::ReportFailureDetectedTask()
>      0x1ccf4ed kudu::FunctionRunnable::Run()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to