[ 
https://issues.apache.org/jira/browse/KUDU-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned KUDU-1365:
---------------------------------

    Assignee: Todd Lipcon  (was: Mike Percy)

> Leader flapping when one machine has a very slow disk
> -----------------------------------------------------
>
>                 Key: KUDU-1365
>                 URL: https://issues.apache.org/jira/browse/KUDU-1365
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>    Affects Versions: 0.7.0
>            Reporter: Mike Percy
>            Assignee: Todd Lipcon
>
> There is an issue that [~decster] ran into in production where a machine had 
> a bad disk. Unfortunately, the disk was not failing to write, it was just 
> very slow. This resulted in the UpdateConsensus RPC queue filling up on that 
> machine when it was a follower, which looked like this in the log (from the 
> leader's perspective):
> {code}
> W0229 00:07:14.332468 18148 consensus_peers.cc:316] T 
> 41e637c3c2b34e8db36d138c4d37d032 P 6434970484be4d29855f05e1f6aed1b8 -> Peer 
> 7f331507718d477f96d60eb1bc573baa (st128:18700): Couldn't send request to peer 
> 7f331507718d477f96d60eb1bc573baa for tablet 41e637c3c2b34e8db36d138c4d37d032 
> Status: Remote error: Service unavailable: UpdateConsensus request on 
> kudu.consensus.ConsensusService dropped due to backpressure. The service 
> queue is full; it has 50 items.. Retrying in the next heartbeat period. 
> Already tried 25 times.
> {code}
> The result is that the follower could not receive heartbeat messages from the 
> leader anymore. The follower (st128) would decide that the leader was dead 
> and start an election. Because it had the same amount of data as the rest of 
> the cluster, it won the election. Then, for reasons we still need to 
> investigate, it would not heartbeat to its own followers. After some timeout, 
> a different node (typically the previous leader) would start an election, get 
> elected, and the flapping process would continue.
> It's possible that the bad node, when leader, was only partially transitioned 
> to leadership, and was blocking on some disk operation before starting to 
> heartbeat. Hopefully we can get logs from the bad node so we can better 
> understand what was happening from its perspective.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to