[ 
https://issues.apache.org/jira/browse/KUDU-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15423922#comment-15423922
 ] 

Todd Lipcon commented on KUDU-1564:
-----------------------------------

It seems like the many TryRemoveFollower tasks were submitted in quick 
succession due to one of the followers falling behind retention while a very 
high ingest rate was being sustained. Each incoming write RPC resulted in 
another copy of the TryRemoveFollower task being submitted to the raft pool:

{code}
I0816 18:46:22.246892 42382 consensus_queue.cc:578] T 
8279136da07443c0befa1cfedcaf2f17 P a95a168ec7414143a47604df73eb1f66 [LEADER]: 
Connected to new peer: Peer: f283a0b008d8473f947bf160f5f1da6d, Is new: false, 
Last received: 89.88204, Next index
I0816 18:46:22.247006 42382 consensus_peers.cc:186] T 
8279136da07443c0befa1cfedcaf2f17 P a95a168ec7414143a47604df73eb1f66 -> Peer 
f283a0b008d8473f947bf160f5f1da6d (e1205.halxg.cloudera.com:7050): Could not 
obtain request from queue for peer: f28
I0816 18:46:22.247190 42382 consensus_peers.cc:186] T 
8279136da07443c0befa1cfedcaf2f17 P a95a168ec7414143a47604df73eb1f66 -> Peer 
f283a0b008d8473f947bf160f5f1da6d (e1205.halxg.cloudera.com:7050): Could not 
obtain request from queue for peer: f28
I0816 18:46:22.247879 42382 consensus_peers.cc:186] T 
8279136da07443c0befa1cfedcaf2f17 P a95a168ec7414143a47604df73eb1f66 -> Peer 
f283a0b008d8473f947bf160f5f1da6d (e1205.halxg.cloudera.com:7050): Could not 
obtain request from queue for peer: f28
I0816 18:46:22.248777 42382 consensus_peers.cc:186] T 
8279136da07443c0befa1cfedcaf2f17 P a95a168ec7414143a47604df73eb1f66 -> Peer 
f283a0b008d8473f947bf160f5f1da6d (e1205.halxg.cloudera.com:7050): Could not 
obtain request from queue for peer: f28
I0816 18:46:22.250759 42382 raft_consensus.cc:681] T 
8279136da07443c0befa1cfedcaf2f17 P a95a168ec7414143a47604df73eb1f66: Attempting 
to remove follower f283a0b008d8473f947bf160f5f1da6d from the Raft config. 
Reason: The logs necessary to catch up
I0816 18:46:22.251900 42575 raft_consensus.cc:681] T 
8279136da07443c0befa1cfedcaf2f17 P a95a168ec7414143a47604df73eb1f66: Attempting 
to remove follower f283a0b008d8473f947bf160f5f1da6d from the Raft config. 
Reason: The logs necessary to catch up
I0816 18:46:22.251972 42576 raft_consensus.cc:681] T 
8279136da07443c0befa1cfedcaf2f17 P a95a168ec7414143a47604df73eb1f66: Attempting 
to remove follower f283a0b008d8473f947bf160f5f1da6d from the Raft config. 
Reason: The logs necessary to catch up
I0816 18:46:22.252182 42577 raft_consensus.cc:681] T 
8279136da07443c0befa1cfedcaf2f17 P a95a168ec7414143a47604df73eb1f66: Attempting 
to remove follower f283a0b008d8473f947bf160f5f1da6d from the Raft config. 
Reason: The logs necessary to catch up
I0816 18:46:22.254709 42578 consensus_peers.cc:186] T 
8279136da07443c0befa1cfedcaf2f17 P a95a168ec7414143a47604df73eb1f66 -> Peer 
f283a0b008d8473f947bf160f5f1da6d (e1205.halxg.cloudera.com:7050): Could not 
obtain request from queue for peer: f28
I0816 18:46:22.256461 42579 consensus_peers.cc:186] T 
8279136da07443c0befa1cfedcaf2f17 P a95a168ec7414143a47604df73eb1f66 -> Peer 
f283a0b008d8473f947bf160f5f1da6d (e1205.halxg.cloudera.com:7050): Could not 
obtain request from queue for peer: f28
I0816 18:46:22.256589 42578 raft_consensus.cc:681] T 
8279136da07443c0befa1cfedcaf2f17 P a95a168ec7414143a47604df73eb1f66: Attempting 
to remove follower f283a0b008d8473f947bf160f5f1da6d from the Raft config. 
Reason: The logs necessary to catch up
I0816 18:46:22.256608 42579 raft_consensus.cc:681] T 
8279136da07443c0befa1cfedcaf2f17 P a95a168ec7414143a47604df73eb1f66: Attempting 
to remove follower f283a0b008d8473f947bf160f5f1da6d from the Raft config. 
Reason: The logs necessary to catch up
I0816 18:46:22.256824 42580 consensus_peers.cc:186] T 
8279136da07443c0befa1cfedcaf2f17 P a95a168ec7414143a47604df73eb1f66 -> Peer 
f283a0b008d8473f947bf160f5f1da6d (e1205.halxg.cloudera.com:7050): Could not 
obtain request from queue for peer: f28
I0816 18:46:22.256983 42580 raft_consensus.cc:681] T 
8279136da07443c0befa1cfedcaf2f17 P a95a168ec7414143a47604df73eb1f66: Attempting 
to remove follower f283a0b008d8473f947bf160f5f1da6d from the Raft config. 
Reason: The logs necessary to catch up
{code}


> Deadlock on raft notification ThreadPool
> ----------------------------------------
>
>                 Key: KUDU-1564
>                 URL: https://issues.apache.org/jira/browse/KUDU-1564
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>    Affects Versions: 0.10.0
>            Reporter: Todd Lipcon
>            Priority: Critical
>         Attachments: stacks.txt
>
>
> In a stress test on a cluster, one of the tablet servers got stuck in a 
> deadlock. It appears that:
> - the Raft notification threadpool for a tablet has 24 max threads 
> (corresponding to the number of cores)
> - One of the threads is in:
> {code}
> #1  0x00000000019019b2 in kudu::Semaphore::Acquire() ()
> #2  0x0000000000985159 in kudu::consensus::Peer::Close() ()
> #3  0x000000000099d909 in kudu::consensus::PeerManager::Close() ()
> #4  0x00000000009684bd in 
> kudu::consensus::RaftConsensus::RefreshConsensusQueueAndPeersUnlocked() ()
> #5  0x000000000096eced in 
> kudu::consensus::RaftConsensus::ReplicateConfigChangeUnlocked(kudu::consensus::RaftConfigPB
>  const&, kudu::consensus::RaftConfigPB const&, kudu::Callback<void 
> ()(kudu::Status const&)> const&) ()
> #6  0x00000000009795be in 
> kudu::consensus::RaftConsensus::ChangeConfig(kudu::consensus::ChangeConfigRequestPB
>  const&, kudu::Callback<void ()(kudu::Status const&)> const&, 
> boost::optional<kudu::tserver::TabletServerErrorPB_Code>*) ()
> #7  0x0000000000978cd0 in 
> kudu::consensus::RaftConsensus::TryRemoveFollowerTask(std::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > const&, 
> kudu::consensus::RaftConfigPB const&, std::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > const&) ()
> {code}
> the rest are in:
> {code}
> #1  0x0000000001924e87 in base::SpinLock::SlowLock() ()
> #2  0x000000000097df28 in 
> kudu::consensus::ReplicaState::LockForConfigChange(std::unique_lock<kudu::simple_spinlock>*)
>  const ()
> #3  0x00000000009791dd in 
> kudu::consensus::RaftConsensus::ChangeConfig(kudu::consensus::ChangeConfigRequestPB
>  const&, kudu::Callback<void ()(kudu::Status const&)> const&, 
> boost::optional<kudu::tserver::TabletServerErrorPB_Code>*) ()
> #4  0x0000000000978cd0 in 
> kudu::consensus::RaftConsensus::TryRemoveFollowerTask(std::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > const&, 
> kudu::consensus::RaftConfigPB const&, std::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > const&) ()
> {code}
> It appears that the thread holding the lock is waiting on a peer response (in 
> order to close the peer), but the peer response is waiting in the 
> ThreadPool's queue (and will never arrive since all threads are occupied 
> waiting on something waiting for it)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to