[
https://issues.apache.org/jira/browse/KUDU-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15556307#comment-15556307
]
Todd Lipcon commented on KUDU-1564:
-----------------------------------
Just hit another case of this, we should probably prioritize this.
> Deadlock on raft notification ThreadPool
> ----------------------------------------
>
> Key: KUDU-1564
> URL: https://issues.apache.org/jira/browse/KUDU-1564
> Project: Kudu
> Issue Type: Bug
> Components: consensus
> Affects Versions: 0.10.0
> Reporter: Todd Lipcon
> Priority: Critical
> Attachments: stacks.txt
>
>
> In a stress test on a cluster, one of the tablet servers got stuck in a
> deadlock. It appears that:
> - the Raft notification threadpool for a tablet has 24 max threads
> (corresponding to the number of cores)
> - One of the threads is in:
> {code}
> #1 0x00000000019019b2 in kudu::Semaphore::Acquire() ()
> #2 0x0000000000985159 in kudu::consensus::Peer::Close() ()
> #3 0x000000000099d909 in kudu::consensus::PeerManager::Close() ()
> #4 0x00000000009684bd in
> kudu::consensus::RaftConsensus::RefreshConsensusQueueAndPeersUnlocked() ()
> #5 0x000000000096eced in
> kudu::consensus::RaftConsensus::ReplicateConfigChangeUnlocked(kudu::consensus::RaftConfigPB
> const&, kudu::consensus::RaftConfigPB const&, kudu::Callback<void
> ()(kudu::Status const&)> const&) ()
> #6 0x00000000009795be in
> kudu::consensus::RaftConsensus::ChangeConfig(kudu::consensus::ChangeConfigRequestPB
> const&, kudu::Callback<void ()(kudu::Status const&)> const&,
> boost::optional<kudu::tserver::TabletServerErrorPB_Code>*) ()
> #7 0x0000000000978cd0 in
> kudu::consensus::RaftConsensus::TryRemoveFollowerTask(std::basic_string<char,
> std::char_traits<char>, std::allocator<char> > const&,
> kudu::consensus::RaftConfigPB const&, std::basic_string<char,
> std::char_traits<char>, std::allocator<char> > const&) ()
> {code}
> the rest are in:
> {code}
> #1 0x0000000001924e87 in base::SpinLock::SlowLock() ()
> #2 0x000000000097df28 in
> kudu::consensus::ReplicaState::LockForConfigChange(std::unique_lock<kudu::simple_spinlock>*)
> const ()
> #3 0x00000000009791dd in
> kudu::consensus::RaftConsensus::ChangeConfig(kudu::consensus::ChangeConfigRequestPB
> const&, kudu::Callback<void ()(kudu::Status const&)> const&,
> boost::optional<kudu::tserver::TabletServerErrorPB_Code>*) ()
> #4 0x0000000000978cd0 in
> kudu::consensus::RaftConsensus::TryRemoveFollowerTask(std::basic_string<char,
> std::char_traits<char>, std::allocator<char> > const&,
> kudu::consensus::RaftConfigPB const&, std::basic_string<char,
> std::char_traits<char>, std::allocator<char> > const&) ()
> {code}
> It appears that the thread holding the lock is waiting on a peer response (in
> order to close the peer), but the peer response is waiting in the
> ThreadPool's queue (and will never arrive since all threads are occupied
> waiting on something waiting for it)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)