Igniters, As you probably know, we are working on MVCC feature [1]. When MVCC is enabled every cache operation/tx require one unique version for read and possibly one version for write. Versions are assigned by a single node called *coordinator*.
Requirements for version: 1) Must be increasing and comparable, so that we understand which operation occurred earlier. 2) Support coordinator failtover 3) As compact as possible Currently we implemented naive approach where version is represented as two long values (16 bytes total): 1) Coordinator version (a kind of mix of a timestamp and topology version) 2) Local coordinator counter. Other vendors typically assign 4 byte (long) or 8 byte (long) versions, which is definitely better as they are faster to compare and require less space, what could be especially important for small values, such as indexes. I thought a bit, and came to conclusion that we could fit into 64 bit version (8 bytes) as follows: [0..47] - local coordinator counter; this gives capacity 10^14, which should be enough for any sensible workload [48..60] - coordinator version, which is incremented on every coordinator change (relatively rare) or local counter wraparound (very rare); this gives 8192 versions [61] - coordinator epoch [62] - delete marker, set on entry removal, cleared on resurrection (e.g. TX rollback, or index change A->B->A) [63] - freeze marker Mechanics: 1) Every coordinator change increments coordinator version and share it via discovery 2) If persistence is enabled, coordinator version is persisted and possibly WAL logged 3) Epoch is switched between 0 and 1 on every coordinator version wraparound (extremely rare event) 4) Epoch is cluster-wide properly, every node knows current epoch 5) As epoch switches, nodes start marking entries from previous epoch as "frozen". This is done in a background in a very light-weight mode, as there is no rush 6) Epoch 0 could not be switched to 1 if there are non-frozen entries from previous epoch 1, and vice versa. This way version comparison is always possible. IMO this should do the trick so that we have 8 byte version with virtually zero overhead in runtime. But a little distributed magic would be needed. Thoughts? P.S.: Idea is borrowed from PostgreSQL counter wraparound handling. [1] https://issues.apache.org/jira/browse/IGNITE-3478