Todd Lipcon has uploaded a new patch set (#2).
Change subject: WIP: cleanup/refactoring in consensus
WIP: cleanup/refactoring in consensus
This is a fairly invasive cleanup/refactor to consensus in preparation
for propagating replication watermarks from the leader to the followers.
The key item here is to address a long-standing TODO that the
ConsensusQueue should itself calculate the commit index, rather than
delegate that job to ReplicaState. The flow used to be something like
this on the leader upon receiving a response from a peer that it has
replicated some new operations:
- updates peer last_received
- calls PeerMessageQueue::AdvanceQueueWatermark
- updates PeerMessageQueue::majority_replicated to the new majority
- calls NotifyObserversOfMajorityReplOp
--> submits an async task to the raft threadpool
- for each observer (best I can tell there is currently always
only one, but that's a separate issue)
-- observer->UpdateMajorityReplicated() to grab committed index
(the observer is always the RaftConsensus instance)
----- this checks the condition that the majority-replicated
watermark is within the leader's term before advancing the
----- calls AdvanceCommittedIndexUnlocked
------- commits stuff, etc
----- sets *committed_index out-param
-- PeerMessageQueue picks up this out-param and sets the new
value within the PeerMessageQueue
This was very difficult to follow, given that the "observer" in fact was
passing data back to the "notifier" through an out-parameter.
The new design makes the PeerMessageQueue itself fully responsible for
calculating when the commit index can advance. This required a fairly
invasive surgery because the PeerMessageQueue itself doesn't remember
the terms of pending operations, and thus the watermarks became indexes
instead of OpIds.
This is itself also a simplification -- we previously were very messy
about using the word "index" when in fact we had an OpId type. In almost
every case, we only used the index of those OpIds. In the Raft paper
itself, these watermarks are just indexes and not OpIds, so this change
brings us closer to the implementation described in the original design.
WIP because there are a few TODOs yet to iron out, although it seems
like this is at least passing most of the stress tests in
raft-consensus-itest. There does seem to be one more lurking issue,
though, because in a dist-test run, a couple of tests timedout.
14 files changed, 387 insertions(+), 461 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/33/4133/2
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Owner: Todd Lipcon <t...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mpe...@apache.org>