Hello Tidy Bot, Mike Percy, Alexey Serbin, Kudu Jenkins, Todd Lipcon,

I'd like you to reexamine a change. Please visit


to look at the new patch set (#17).

Change subject: Simplify OpId/Timestamp assignment and make it atomic

Simplify OpId/Timestamp assignment and make it atomic


Each message serialized through consensus is assigned a Timestamp and an OpId.
These indexes are used in different contexts to establish ordering and, for the
most part, they are monotonically increasing. In particular, for write messages,
successive OpIds are guaranteed to have increasing Timestamps. For non-write
messages though, like NO_OP and CONFIG_CHANGE, this is not the case. Because
timestamps are assigned to messages in different places (for write messages they
are assigned in the prepare phase, by the TransactionDriver, while for NO_OP and
CONFIG_CHANGE they are assigned inside consensus) there was no way, until now,
to make sure that timestamps are monotonically increasing across message types,
like OpIds are.

Not being able to trust that timestamps are monotonically increasing across
all message types, means that we can't use the timestamps from non-write
messages to advance time. This is a requirement for leader leases, as we
need a single source of truth regarding time advancement and need config
changes in particular to have been assigned timestamps we can trust, but
pehaps more importantly it's also a requirement to be able to advance time
on bootstrap when there are no write messages in the WAL. Not being able
to do this is the underlying cause of KUDU-2233.


There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.
These convoluted interactions make it so that OpId and Timestamp are
not assigned atomically.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_meta-test.cc
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
23 files changed, 350 insertions(+), 275 deletions(-)

  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/17
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 17
Gerrit-Owner: David Ribeiro Alves <davidral...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <davidral...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mpe...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>

Reply via email to