Adar Dembo created KUDU-1613:
--------------------------------
Summary: Under certain circumstances, tablet leader does not evict
failed replica
Key: KUDU-1613
URL: https://issues.apache.org/jira/browse/KUDU-1613
Project: Kudu
Issue Type: Bug
Components: consensus, tablet
Affects Versions: 1.0.0
Reporter: Adar Dembo
Priority: Critical
Dan found this while working on Kudu training material.
Suppose you have a three node cluster and a table with a singleton tablet
(replicated three times). Now suppose you stopped one tserver, deleted all of
its on-disk data, then restarted it.
You would expect the following:
# The tablet's leader replica can no longer reach the replica on the
reformatted tserver.
# The leader will evict that replica.
# The master will notice the tablet's under-replication and ask the leader to
add a new replica, probably on the reformatted node.
Instead, there's no eviction at all. The leader replica keeps spewing messages
like this in its log:
{noformat}
W0913 14:13:18.411238 22597 consensus_peers.cc:332] T
89dfba0c0a714259acf69d9f611e1e92 P 1540ac6e6cb44c2c9f9c6c6c98fd61f7 -> Peer
cc2ef23f1c2c42b7a6a02d7183d92884 (dan-test-g-2.gce.cloudera.com:7050): Couldn't
send request to peer cc2ef23f1c2c42b7a6a02d7183d92884 for tablet
89dfba0c0a714259acf69d9f611e1e92. Error code: WRONG_SERVER_UUID (16). Status:
Invalid argument: UpdateConsensus: Wrong destination UUID requested. Local
UUID: ef3ea81d59fc4a91b754cfe63b21e6ee. Requested UUID:
cc2ef23f1c2c42b7a6a02d7183d92884. Retrying in the next heartbeat period.
Already tried 5821 times.
{noformat}
Having looked at the code responsible for starting replica eviction
(PeerMessageQueue::RequestForPeer) and the code spewing that error
(Peer::ProcessResponseError), I think I see what's going on. The eviction code
in RequestforPeer() checks the peer's "last successful communication time" to
decide whether to evict or not. Intuitively you'd expect that time to be
updated only when the peer responds successfully, but there are a couple cases
in Peer::ProcessResponseError where we update the last communication time
anyway. Notably:
# If the RPC controller yielded a RemoteError, or
# If the RPC controller had no error but the response itself contained an
error, and the error's code was not TABLET_NOT_FOUND, or
# If the RPC controller and the response had no error, but the response's
status had an error, and that error's code was CANNOT_PREPARE.
I think we're hitting case #2, because there should be no RPC controller error
(the reformatted tserver did respond to the leader replica), but the response
does contain a WRONG_SERVER_UUID error.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)