[
https://issues.apache.org/jira/browse/KUDU-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Todd Lipcon reassigned KUDU-1365:
---------------------------------
Assignee: Todd Lipcon (was: Mike Percy)
> Leader flapping when one machine has a very slow disk
> -----------------------------------------------------
>
> Key: KUDU-1365
> URL: https://issues.apache.org/jira/browse/KUDU-1365
> Project: Kudu
> Issue Type: Bug
> Components: consensus
> Affects Versions: 0.7.0
> Reporter: Mike Percy
> Assignee: Todd Lipcon
>
> There is an issue that [~decster] ran into in production where a machine had
> a bad disk. Unfortunately, the disk was not failing to write, it was just
> very slow. This resulted in the UpdateConsensus RPC queue filling up on that
> machine when it was a follower, which looked like this in the log (from the
> leader's perspective):
> {code}
> W0229 00:07:14.332468 18148 consensus_peers.cc:316] T
> 41e637c3c2b34e8db36d138c4d37d032 P 6434970484be4d29855f05e1f6aed1b8 -> Peer
> 7f331507718d477f96d60eb1bc573baa (st128:18700): Couldn't send request to peer
> 7f331507718d477f96d60eb1bc573baa for tablet 41e637c3c2b34e8db36d138c4d37d032
> Status: Remote error: Service unavailable: UpdateConsensus request on
> kudu.consensus.ConsensusService dropped due to backpressure. The service
> queue is full; it has 50 items.. Retrying in the next heartbeat period.
> Already tried 25 times.
> {code}
> The result is that the follower could not receive heartbeat messages from the
> leader anymore. The follower (st128) would decide that the leader was dead
> and start an election. Because it had the same amount of data as the rest of
> the cluster, it won the election. Then, for reasons we still need to
> investigate, it would not heartbeat to its own followers. After some timeout,
> a different node (typically the previous leader) would start an election, get
> elected, and the flapping process would continue.
> It's possible that the bad node, when leader, was only partially transitioned
> to leadership, and was blocking on some disk operation before starting to
> heartbeat. Hopefully we can get logs from the bad node so we can better
> understand what was happening from its perspective.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)