[ 
https://issues.apache.org/jira/browse/KUDU-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15750931#comment-15750931
 ] 

Todd Lipcon commented on KUDU-1564:
-----------------------------------

This should be fixed by KUDU-699

> Deadlock on raft notification ThreadPool
> ----------------------------------------
>
>                 Key: KUDU-1564
>                 URL: https://issues.apache.org/jira/browse/KUDU-1564
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>    Affects Versions: 0.10.0
>            Reporter: Todd Lipcon
>            Priority: Critical
>         Attachments: stacks.txt
>
>
> In a stress test on a cluster, one of the tablet servers got stuck in a 
> deadlock. It appears that:
> - the Raft notification threadpool for a tablet has 24 max threads 
> (corresponding to the number of cores)
> - One of the threads is in:
> {code}
> #1  0x00000000019019b2 in kudu::Semaphore::Acquire() ()
> #2  0x0000000000985159 in kudu::consensus::Peer::Close() ()
> #3  0x000000000099d909 in kudu::consensus::PeerManager::Close() ()
> #4  0x00000000009684bd in 
> kudu::consensus::RaftConsensus::RefreshConsensusQueueAndPeersUnlocked() ()
> #5  0x000000000096eced in 
> kudu::consensus::RaftConsensus::ReplicateConfigChangeUnlocked(kudu::consensus::RaftConfigPB
>  const&, kudu::consensus::RaftConfigPB const&, kudu::Callback<void 
> ()(kudu::Status const&)> const&) ()
> #6  0x00000000009795be in 
> kudu::consensus::RaftConsensus::ChangeConfig(kudu::consensus::ChangeConfigRequestPB
>  const&, kudu::Callback<void ()(kudu::Status const&)> const&, 
> boost::optional<kudu::tserver::TabletServerErrorPB_Code>*) ()
> #7  0x0000000000978cd0 in 
> kudu::consensus::RaftConsensus::TryRemoveFollowerTask(std::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > const&, 
> kudu::consensus::RaftConfigPB const&, std::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > const&) ()
> {code}
> the rest are in:
> {code}
> #1  0x0000000001924e87 in base::SpinLock::SlowLock() ()
> #2  0x000000000097df28 in 
> kudu::consensus::ReplicaState::LockForConfigChange(std::unique_lock<kudu::simple_spinlock>*)
>  const ()
> #3  0x00000000009791dd in 
> kudu::consensus::RaftConsensus::ChangeConfig(kudu::consensus::ChangeConfigRequestPB
>  const&, kudu::Callback<void ()(kudu::Status const&)> const&, 
> boost::optional<kudu::tserver::TabletServerErrorPB_Code>*) ()
> #4  0x0000000000978cd0 in 
> kudu::consensus::RaftConsensus::TryRemoveFollowerTask(std::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > const&, 
> kudu::consensus::RaftConfigPB const&, std::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > const&) ()
> {code}
> It appears that the thread holding the lock is waiting on a peer response (in 
> order to close the peer), but the peer response is waiting in the 
> ThreadPool's queue (and will never arrive since all threads are occupied 
> waiting on something waiting for it)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to