MVCC version format

Vladimir Ozerov Fri, 15 Dec 2017 01:57:36 -0800

Igniters,

As you probably know, we are working on MVCC feature [1]. When MVCC is
enabled every cache operation/tx require one unique version for read and
possibly one version for write. Versions are assigned by a single node
called *coordinator*.


Requirements for version:
1) Must be increasing and comparable, so that we understand which operation
occurred earlier.
2) Support coordinator failtover
3) As compact as possible

Currently we implemented naive approach where version is represented as two
long values (16 bytes total):
1) Coordinator version (a kind of mix of a timestamp and topology version)
2) Local coordinator counter.

Other vendors typically assign 4 byte (long) or 8 byte (long) versions,
which is definitely better as they are faster to compare and require less
space, what could be especially important for small values, such as indexes.

I thought a bit, and came to conclusion that we could fit into 64 bit
version (8 bytes) as follows:
[0..47] - local coordinator counter; this gives capacity 10^14, which
should be enough for any sensible workload
[48..60] - coordinator version, which is incremented on every coordinator
change (relatively rare) or local counter wraparound (very rare); this
gives 8192 versions
[61] - coordinator epoch
[62] - delete marker, set on entry removal, cleared on resurrection (e.g.
TX rollback, or index change A->B->A)
[63] - freeze marker

Mechanics:
1) Every coordinator change increments coordinator version and share it via
discovery
2) If persistence is enabled, coordinator version is persisted and possibly
WAL logged
3) Epoch is switched between 0 and 1 on every coordinator version
wraparound (extremely rare event)
4) Epoch is cluster-wide properly, every node knows current epoch
5) As epoch switches, nodes start marking entries from previous epoch as
"frozen". This is done in a background in a very light-weight mode, as
there is no rush
6) Epoch 0 could not be switched to 1 if there are non-frozen entries from
previous epoch 1, and vice versa. This way version comparison is always
possible.

IMO this should do the trick so that we have 8 byte version with virtually
zero overhead in runtime. But a little distributed magic would be needed.

Thoughts?

P.S.: Idea is borrowed from PostgreSQL counter wraparound handling.

[1] https://issues.apache.org/jira/browse/IGNITE-3478

MVCC version format

Reply via email to