Adar Dembo created KUDU-1613:
--------------------------------

             Summary: Under certain circumstances, tablet leader does not evict 
failed replica
                 Key: KUDU-1613
                 URL: https://issues.apache.org/jira/browse/KUDU-1613
             Project: Kudu
          Issue Type: Bug
          Components: consensus, tablet
    Affects Versions: 1.0.0
            Reporter: Adar Dembo
            Priority: Critical


Dan found this while working on Kudu training material.

Suppose you have a three node cluster and a table with a singleton tablet 
(replicated three times). Now suppose you stopped one tserver, deleted all of 
its on-disk data, then restarted it.

You would expect the following:
# The tablet's leader replica can no longer reach the replica on the 
reformatted tserver.
# The leader will evict that replica.
# The master will notice the tablet's under-replication and ask the leader to 
add a new replica, probably on the reformatted node.

Instead, there's no eviction at all. The leader replica keeps spewing messages 
like this in its log:
{noformat}
W0913 14:13:18.411238 22597 consensus_peers.cc:332] T 
89dfba0c0a714259acf69d9f611e1e92 P 1540ac6e6cb44c2c9f9c6c6c98fd61f7 -> Peer 
cc2ef23f1c2c42b7a6a02d7183d92884 (dan-test-g-2.gce.cloudera.com:7050): Couldn't 
send request to peer cc2ef23f1c2c42b7a6a02d7183d92884 for tablet 
89dfba0c0a714259acf69d9f611e1e92. Error code: WRONG_SERVER_UUID (16). Status: 
Invalid argument: UpdateConsensus: Wrong destination UUID requested. Local 
UUID: ef3ea81d59fc4a91b754cfe63b21e6ee. Requested UUID: 
cc2ef23f1c2c42b7a6a02d7183d92884. Retrying in the next heartbeat period. 
Already tried 5821 times.
{noformat}

Having looked at the code responsible for starting replica eviction 
(PeerMessageQueue::RequestForPeer) and the code spewing that error 
(Peer::ProcessResponseError), I think I see what's going on. The eviction code 
in RequestforPeer() checks the peer's "last successful communication time" to 
decide whether to evict or not. Intuitively you'd expect that time to be 
updated only when the peer responds successfully, but there are a couple cases 
in Peer::ProcessResponseError where we update the last communication time 
anyway. Notably:
# If the RPC controller yielded a RemoteError, or
# If the RPC controller had no error but the response itself contained an 
error, and the error's code was not TABLET_NOT_FOUND, or
# If the RPC controller and the response had no error, but the response's 
status had an error, and that error's code was CANNOT_PREPARE.

I think we're hitting case #2, because there should be no RPC controller error 
(the reformatted tserver did respond to the leader replica), but the response 
does contain a WRONG_SERVER_UUID error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to