[
https://issues.apache.org/jira/browse/KUDU-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180062#comment-15180062
]
Todd Lipcon commented on KUDU-1365:
-----------------------------------
What's not clear to me is why this code in RequestVote isn't sufficient?
{code}
// If we've heard recently from the leader, then we should ignore the request.
// It might be from a "disruptive" server. This could happen in a few cases:
//
// 1) Network partitions
// If the leader can talk to a majority of the nodes, but is partitioned from
a
// bad node, the bad node's failure detector will trigger. If the bad node is
// able to reach other nodes in the cluster, it will continuously trigger
elections.
//
// 2) An abandoned node
// It's possible that a node has fallen behind the log GC mark of the leader.
In that
// case, the leader will stop sending it requests. Eventually, the the
configuration
// will change to eject the abandoned node, but until that point, we don't
want the
// abandoned follower to disturb the other nodes.
//
// See also https://ramcloud.stanford.edu/~ongaro/thesis.pdf
// section 4.2.3.
MonoTime now = MonoTime::Now(MonoTime::COARSE);
if (!request->ignore_live_leader() &&
now.ComesBefore(withhold_votes_until_)) {
return RequestVoteRespondLeaderIsAlive(request, response);
}
{code}
If you have this bad node getting timeouts and requesting votes, it seems like
it shouldn't cause the other terms to advance.
Perhaps the issue is that the bad node is advancing its _own_ term, and then
eventually leader requests do get through and causes the leader to step down
when it responds with INVALID_TERM?
Some kind of timeout on the LockForUpdate() call in UpdateReplica() could also
help avoid these RPCs "stacking up" if the disk's too slow to keep up with the
request rate.
> Leader flapping when one machine has a very slow disk
> -----------------------------------------------------
>
> Key: KUDU-1365
> URL: https://issues.apache.org/jira/browse/KUDU-1365
> Project: Kudu
> Issue Type: Bug
> Components: consensus
> Affects Versions: 0.7.0
> Reporter: Mike Percy
> Assignee: Mike Percy
>
> There is an issue that [~decster] ran into in production where a machine had
> a bad disk. Unfortunately, the disk was not failing to write, it was just
> very slow. This resulted in the UpdateConsensus RPC queue filling up on that
> machine when it was a follower, which looked like this in the log (from the
> leader's perspective):
> {code}
> W0229 00:07:14.332468 18148 consensus_peers.cc:316] T
> 41e637c3c2b34e8db36d138c4d37d032 P 6434970484be4d29855f05e1f6aed1b8 -> Peer
> 7f331507718d477f96d60eb1bc573baa (st128:18700): Couldn't send request to peer
> 7f331507718d477f96d60eb1bc573baa for tablet 41e637c3c2b34e8db36d138c4d37d032
> Status: Remote error: Service unavailable: UpdateConsensus request on
> kudu.consensus.ConsensusService dropped due to backpressure. The service
> queue is full; it has 50 items.. Retrying in the next heartbeat period.
> Already tried 25 times.
> {code}
> The result is that the follower could not receive heartbeat messages from the
> leader anymore. The follower (st128) would decide that the leader was dead
> and start an election. Because it had the same amount of data as the rest of
> the cluster, it won the election. Then, for reasons we still need to
> investigate, it would not heartbeat to its own followers. After some timeout,
> a different node (typically the previous leader) would start an election, get
> elected, and the flapping process would continue.
> It's possible that the bad node, when leader, was only partially transitioned
> to leadership, and was blocking on some disk operation before starting to
> heartbeat. Hopefully we can get logs from the bad node so we can better
> understand what was happening from its perspective.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)