[ 
https://issues.apache.org/jira/browse/KUDU-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15713618#comment-15713618
 ] 

Todd Lipcon commented on KUDU-1779:
-----------------------------------

Here's some notes on the state of the tablet:

- on all three replicas, the transaction tracker is at the limit
-- the tracker has 5-10 transactions which have been assigned OpIds and many 
other LEADER transactions that have no OpId assigned yet (i.e they are still in 
the prepare queue)
--- these were submitted to the prepare queue at some past time when the 
replica was LEADER, even though in many cases the replica is no longer LEADER 
and we know that the operations will eventually fail once they make it out of 
the prepare queue
-- the ops themselves have some row key conflicts: one op has been prepared and 
is waiting to be replicated, and another op is now blocked waiting to obtain 
the same row key lock. This has "stopped" the prepare thread from making any 
progress

- the leader's queue state looks like this:

{code}
Peer: 2d52a82cb62c4064842cf628d01083fe
 Is new: false
 Last received: 151.174779
 Next index: 174780
 Last known committed idx: 174762
 Last exchange result: ERROR
 Needs tablet copy: false

Peer: 1d2e1b44012e419392279a8e674638cb
 Is new: false
 Last received: 145.174776
 Next index: 174777
 Last known committed idx: 174762
 Last exchange result: ERROR
 Needs tablet copy: false

(LEADER) Peer: c101883d3e82496989a5f9f667c30e38
 Is new: false
 Last received: 153.174786
 Next index: 174787
 Last known committed idx: 174762
 Last exchange result: SUCCESS
 Needs tablet copy: false

Consensus queue metrics:
Only Majority Done Ops: 174762
In Progress Ops: 24
Cache: LogCacheStats(num_ops=1115, bytes=54426033)

{code}
... its two followers are in 'ERROR' state because every attempt to replicate 
to them is getting CANNOT_PREPARE errors.

I captured an RPC exchange on a replica using /tracing.html:

{code}
request:
  preceding: 151.174779
  committed_index: 174762
  ops: 151.174780 - 153.174786

response:
  code: CANNOT_PREPARE:
  last_received: 151.174779
  last_received_current_leader: 0.0
  last_committed: 174762
{code}

the follower is logging:
{code}
I1201 17:06:01.941049 111261 raft_consensus.cc:866] T 
9f7e54d8b37645c084a46fc6b27ddd34 P 2d52a82cb62c4064842cf628d01083fe [term 161 
FOLLOWER]: Deduplicated request from leader. Original: 
144.174762->[144.174763-161.174790]   Dedup: 151.174779->[151.174780-161.174790]
W1201 17:06:01.941187 111261 raft_consensus.cc:1218] T 
9f7e54d8b37645c084a46fc6b27ddd34 P 2d52a82cb62c4064842cf628d01083fe [term 161 
FOLLOWER]: Could not prepare transaction for op: 151.174780. Suppressed 10 
other warnings. Status for this op: Service unavailable: Transaction failed, 
tablet 9f7e54d8b37645c084a46fc6b27ddd34 transaction memory consumption 
(67108557) has exceeded its limit (67108864) or the limit of an ancestral 
tracker
I1201 17:06:01.941339 111261 raft_consensus.cc:1232] T 
9f7e54d8b37645c084a46fc6b27ddd34 P 2d52a82cb62c4064842cf628d01083fe [term 161 
FOLLOWER]: Rejecting Update request from peer c101883d3e82496989a5f9f667c30e38 
for term 161. Could not prepare a single transaction due to: Service 
unavailable: Transaction failed, tablet 9f7e54d8b37645c084a46fc6b27ddd34 
transaction memory consumption (67108557) has exceeded its limit (67108864) or 
the limit of an ancestral tracker
{code}

the leader is logging:
{code}
W1201 17:06:34.255569 182530 wire_protocol.cc:150] Unknown error code in 
status: 
W1201 17:06:34.255590 182530 consensus_peers.cc:328] T 
9f7e54d8b37645c084a46fc6b27ddd34 P c101883d3e82496989a5f9f667c30e38 -> Peer 
2d52a82cb62c4064842cf628d01083fe (ve1120.halxg.cloudera.com:7050): Couldn't 
send request to peer 2d52a82cb62c4064842cf628d01083fe for tablet 
9f7e54d8b37645c084a46fc6b27ddd34. Status: Runtime error: (unknown error code). 
Retrying in the next heartbeat period. Already trie
{code}

So it seems like there are a few things going wrong here:
- when the follower responds with 'CANNOT_PREPARE', the leader should still 
take into account that follower's 'last_received' index in order to update the 
majority-replicated watermark. Note here that the majority 'last_received' is 
id 174779, but the queue watermark is calculated at 174762.
- the leader is not understanding the error response being sent back and 
logging that odd 'Unknown error code in Status' message. This may contribute to 
the above
- we seem to be lacking appropriate test coverage of cases where the 
transaction tracker gets full


> Consensus "stuck" with all transaction trackers are at limit
> ------------------------------------------------------------
>
>                 Key: KUDU-1779
>                 URL: https://issues.apache.org/jira/browse/KUDU-1779
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>    Affects Versions: 1.1.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Critical
>
> In a stress cluster, I saw one tablet get "stuck" in the following state:
> - the transaction_tracker on all three replicas is "full" (no more can be 
> submitted)
> - leader elections proceed just fine, but no leader is able to advance the 
> commit index
> The issue seems to be that a replica will respond with 'CANNOT_PREPARE' when 
> its transaction tracker is full. The leader then ignores this response, and 
> doesn't advance the majority-replicated watermark. The transaction tracker 
> stays full forever because the in-flight transactions can't get committed.
> Notes to follow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to