Hello Tidy Bot, Kudu Jenkins, Todd Lipcon,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#10).

Change subject: Simplify OpId/Timestamp assignment and make it atomic
......................................................................

Simplify OpId/Timestamp assignment and make it atomic

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.

These convoluted interactions are currently preventing a clean and simple
leader leases implementation, particularly the single timestamp outstanding
in TimeManager.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tserver/tablet_server-test.cc
21 files changed, 335 insertions(+), 270 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/10
--
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 10
Gerrit-Owner: David Ribeiro Alves <davidral...@gmail.com>
Gerrit-Reviewer: David Ribeiro Alves <davidral...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>

Reply via email to