[
https://issues.apache.org/jira/browse/KUDU-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15423922#comment-15423922
]
Todd Lipcon commented on KUDU-1564:
-----------------------------------
It seems like the many TryRemoveFollower tasks were submitted in quick
succession due to one of the followers falling behind retention while a very
high ingest rate was being sustained. Each incoming write RPC resulted in
another copy of the TryRemoveFollower task being submitted to the raft pool:
{code}
I0816 18:46:22.246892 42382 consensus_queue.cc:578] T
8279136da07443c0befa1cfedcaf2f17 P a95a168ec7414143a47604df73eb1f66 [LEADER]:
Connected to new peer: Peer: f283a0b008d8473f947bf160f5f1da6d, Is new: false,
Last received: 89.88204, Next index
I0816 18:46:22.247006 42382 consensus_peers.cc:186] T
8279136da07443c0befa1cfedcaf2f17 P a95a168ec7414143a47604df73eb1f66 -> Peer
f283a0b008d8473f947bf160f5f1da6d (e1205.halxg.cloudera.com:7050): Could not
obtain request from queue for peer: f28
I0816 18:46:22.247190 42382 consensus_peers.cc:186] T
8279136da07443c0befa1cfedcaf2f17 P a95a168ec7414143a47604df73eb1f66 -> Peer
f283a0b008d8473f947bf160f5f1da6d (e1205.halxg.cloudera.com:7050): Could not
obtain request from queue for peer: f28
I0816 18:46:22.247879 42382 consensus_peers.cc:186] T
8279136da07443c0befa1cfedcaf2f17 P a95a168ec7414143a47604df73eb1f66 -> Peer
f283a0b008d8473f947bf160f5f1da6d (e1205.halxg.cloudera.com:7050): Could not
obtain request from queue for peer: f28
I0816 18:46:22.248777 42382 consensus_peers.cc:186] T
8279136da07443c0befa1cfedcaf2f17 P a95a168ec7414143a47604df73eb1f66 -> Peer
f283a0b008d8473f947bf160f5f1da6d (e1205.halxg.cloudera.com:7050): Could not
obtain request from queue for peer: f28
I0816 18:46:22.250759 42382 raft_consensus.cc:681] T
8279136da07443c0befa1cfedcaf2f17 P a95a168ec7414143a47604df73eb1f66: Attempting
to remove follower f283a0b008d8473f947bf160f5f1da6d from the Raft config.
Reason: The logs necessary to catch up
I0816 18:46:22.251900 42575 raft_consensus.cc:681] T
8279136da07443c0befa1cfedcaf2f17 P a95a168ec7414143a47604df73eb1f66: Attempting
to remove follower f283a0b008d8473f947bf160f5f1da6d from the Raft config.
Reason: The logs necessary to catch up
I0816 18:46:22.251972 42576 raft_consensus.cc:681] T
8279136da07443c0befa1cfedcaf2f17 P a95a168ec7414143a47604df73eb1f66: Attempting
to remove follower f283a0b008d8473f947bf160f5f1da6d from the Raft config.
Reason: The logs necessary to catch up
I0816 18:46:22.252182 42577 raft_consensus.cc:681] T
8279136da07443c0befa1cfedcaf2f17 P a95a168ec7414143a47604df73eb1f66: Attempting
to remove follower f283a0b008d8473f947bf160f5f1da6d from the Raft config.
Reason: The logs necessary to catch up
I0816 18:46:22.254709 42578 consensus_peers.cc:186] T
8279136da07443c0befa1cfedcaf2f17 P a95a168ec7414143a47604df73eb1f66 -> Peer
f283a0b008d8473f947bf160f5f1da6d (e1205.halxg.cloudera.com:7050): Could not
obtain request from queue for peer: f28
I0816 18:46:22.256461 42579 consensus_peers.cc:186] T
8279136da07443c0befa1cfedcaf2f17 P a95a168ec7414143a47604df73eb1f66 -> Peer
f283a0b008d8473f947bf160f5f1da6d (e1205.halxg.cloudera.com:7050): Could not
obtain request from queue for peer: f28
I0816 18:46:22.256589 42578 raft_consensus.cc:681] T
8279136da07443c0befa1cfedcaf2f17 P a95a168ec7414143a47604df73eb1f66: Attempting
to remove follower f283a0b008d8473f947bf160f5f1da6d from the Raft config.
Reason: The logs necessary to catch up
I0816 18:46:22.256608 42579 raft_consensus.cc:681] T
8279136da07443c0befa1cfedcaf2f17 P a95a168ec7414143a47604df73eb1f66: Attempting
to remove follower f283a0b008d8473f947bf160f5f1da6d from the Raft config.
Reason: The logs necessary to catch up
I0816 18:46:22.256824 42580 consensus_peers.cc:186] T
8279136da07443c0befa1cfedcaf2f17 P a95a168ec7414143a47604df73eb1f66 -> Peer
f283a0b008d8473f947bf160f5f1da6d (e1205.halxg.cloudera.com:7050): Could not
obtain request from queue for peer: f28
I0816 18:46:22.256983 42580 raft_consensus.cc:681] T
8279136da07443c0befa1cfedcaf2f17 P a95a168ec7414143a47604df73eb1f66: Attempting
to remove follower f283a0b008d8473f947bf160f5f1da6d from the Raft config.
Reason: The logs necessary to catch up
{code}
> Deadlock on raft notification ThreadPool
> ----------------------------------------
>
> Key: KUDU-1564
> URL: https://issues.apache.org/jira/browse/KUDU-1564
> Project: Kudu
> Issue Type: Bug
> Components: consensus
> Affects Versions: 0.10.0
> Reporter: Todd Lipcon
> Priority: Critical
> Attachments: stacks.txt
>
>
> In a stress test on a cluster, one of the tablet servers got stuck in a
> deadlock. It appears that:
> - the Raft notification threadpool for a tablet has 24 max threads
> (corresponding to the number of cores)
> - One of the threads is in:
> {code}
> #1 0x00000000019019b2 in kudu::Semaphore::Acquire() ()
> #2 0x0000000000985159 in kudu::consensus::Peer::Close() ()
> #3 0x000000000099d909 in kudu::consensus::PeerManager::Close() ()
> #4 0x00000000009684bd in
> kudu::consensus::RaftConsensus::RefreshConsensusQueueAndPeersUnlocked() ()
> #5 0x000000000096eced in
> kudu::consensus::RaftConsensus::ReplicateConfigChangeUnlocked(kudu::consensus::RaftConfigPB
> const&, kudu::consensus::RaftConfigPB const&, kudu::Callback<void
> ()(kudu::Status const&)> const&) ()
> #6 0x00000000009795be in
> kudu::consensus::RaftConsensus::ChangeConfig(kudu::consensus::ChangeConfigRequestPB
> const&, kudu::Callback<void ()(kudu::Status const&)> const&,
> boost::optional<kudu::tserver::TabletServerErrorPB_Code>*) ()
> #7 0x0000000000978cd0 in
> kudu::consensus::RaftConsensus::TryRemoveFollowerTask(std::basic_string<char,
> std::char_traits<char>, std::allocator<char> > const&,
> kudu::consensus::RaftConfigPB const&, std::basic_string<char,
> std::char_traits<char>, std::allocator<char> > const&) ()
> {code}
> the rest are in:
> {code}
> #1 0x0000000001924e87 in base::SpinLock::SlowLock() ()
> #2 0x000000000097df28 in
> kudu::consensus::ReplicaState::LockForConfigChange(std::unique_lock<kudu::simple_spinlock>*)
> const ()
> #3 0x00000000009791dd in
> kudu::consensus::RaftConsensus::ChangeConfig(kudu::consensus::ChangeConfigRequestPB
> const&, kudu::Callback<void ()(kudu::Status const&)> const&,
> boost::optional<kudu::tserver::TabletServerErrorPB_Code>*) ()
> #4 0x0000000000978cd0 in
> kudu::consensus::RaftConsensus::TryRemoveFollowerTask(std::basic_string<char,
> std::char_traits<char>, std::allocator<char> > const&,
> kudu::consensus::RaftConfigPB const&, std::basic_string<char,
> std::char_traits<char>, std::allocator<char> > const&) ()
> {code}
> It appears that the thread holding the lock is waiting on a peer response (in
> order to close the peer), but the peer response is waiting in the
> ThreadPool's queue (and will never arrive since all threads are occupied
> waiting on something waiting for it)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)