Hello Mike Percy,

I'd like you to do a code review.  Please visit


to review the following change.

Change subject: WIP: cleanup/refactoring in consensus

WIP: cleanup/refactoring in consensus

This is a fairly invasive cleanup/refactor to consensus in preparation
for propagating replication watermarks from the leader to the followers.

The key item here is to address a long-standing TODO that the
ConsensusQueue should itself calculate the commit index, rather than
delegate that job to ReplicaState. The flow used to be something like
this on the leader upon receiving a response from a peer that it has
replicated some new operations:

 - PeerMessageQueue::ResponseFromPeer
   - updates peer last_received
   - calls PeerMessageQueue::AdvanceQueueWatermark
     - updates PeerMessageQueue::majority_replicated to the new majority
   - calls NotifyObserversOfMajorityReplOp
   --> submits an async task to the raft threadpool
        - for each observer (best I can tell there is currently always
          only one, but that's a separate issue)
        -- observer->UpdateMajorityReplicated() to grab committed index
           (the observer is always the RaftConsensus instance)
        --- RaftConsensus::UpdateMajorityReplicated
        ---- ReplicaState::UpdateMajorityReplicatedUnlocked
        ----- this checks the condition that the majority-replicated
              watermark is within the leader's term before advancing the
              commit index.
        ----- calls AdvanceCommittedIndexUnlocked
        ------- commits stuff, etc
        ----- sets *committed_index out-param
        -- PeerMessageQueue picks up this out-param and sets the new
           value within the PeerMessageQueue

This was very difficult to follow, given that the "observer" in fact was
passing data back to the "notifier" through an out-parameter.

The new design makes the PeerMessageQueue itself fully responsible for
calculating when the commit index can advance. This required a fairly
invasive surgery because the PeerMessageQueue itself doesn't remember
the terms of pending operations, and thus the watermarks became indexes
instead of OpIds.

This is itself also a simplification -- we previously were very messy
about using the word "index" when in fact we had an OpId type. In almost
every case, we only used the index of those OpIds. In the Raft paper
itself, these watermarks are just indexes and not OpIds, so this change
brings us closer to the implementation described in the original design.

WIP because there are a few TODOs yet to iron out, although it seems
like this is at least passing most of the stress tests in
raft-consensus-itest. There does seem to be one more lurking issue,
though, because in a dist-test run, a couple of tests timedout.

Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus-test.cc
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/consensus/raft_consensus_state.cc
M src/kudu/consensus/raft_consensus_state.h
M src/kudu/integration-tests/raft_consensus-itest.cc
14 files changed, 386 insertions(+), 440 deletions(-)

  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/33/4133/1
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <t...@apache.org>
Gerrit-Reviewer: Mike Percy <mpe...@apache.org>

Reply via email to