Serge, Cool feature!
I have following questions: 1) MVCC will be a kind of "cache mode"? And we can have caches with old behavior and caches with MVCC? 2) " Garbage collection" - may be we should give another name to not intersect with JVM GC? Thanks! On Tue, Aug 15, 2017 at 8:11 PM, Serge Puchnin n <sergey.puch...@gmail.com> wrote: > Hello Ignite Developers, > > I’d like to start a discussion about the design of MVCC implementation > [1]. > It’ll help us to solve an issue when a user might see a partially > committed transaction. > > Below you can find the proposed design. > Please provide your feedback/thoughts about suggested solution. > > Thanks a lot, > Sergey Puchnin > > [1] https://issues.apache.org/jira/browse/IGNITE-3478 > > > *Multi-Version Concurrency Control Architecture* > > > Abstract > > This page contains high-level description of MVCC Architecture for JIRA > https://issues.apache.org/jira/browse/IGNITE-3478 > Problem Description > > Current Ignite SQL does not take into account transaction boundaries. For > example, if a transaction atomically changes the balance of two accounts, > then a concurrent SQL query can see a partially committed transaction. This > works for data analytics but does not work for more strict and demanding > scenarios. > It would be perfect if we had a mode which ensures transaction boundaries > are taken into account for SQL query. > Design Description > > Multi-Version Concurrency Control (MVCC) is a concurrency control > mechanism that has been implemented in numerous RDBMS systems. > The main approach of this solution is instead of change data in-place a > session creates a new version of data, and different transactions are able > to get a correct version in concurrent surroundings. > Consequently, readers never block writers, writers never block readers, > readers > don't need any locks. > > In cluster environment data are well distributed via different partitions, > nodes and it's necessary to support cases when read-write transactions are > processed by nodes in an accidental order. > > At this point, we have to revise the only transactions that need to update > or read data on more than one partition. > > To provide cross-cache ACID transactions for all isolation levels and lock > models in distributed environment it necessary to determinate what changes > were made before a transaction was started and are visible for the > transaction. > > To solve that new component is provided in the system - TS coordinator and > two new attributes are added to an entity: minTS and maxTS. > The coordinator will order all changes for multi-partition transactions. > minTS and maxTS will determine a visibility of specific version for a > transaction. > > A base workflow is include next steps. > 1. Initialize > > During initialization, a transaction asks a node that acts as a TS > coordinator to get a new transaction identifier (currentTX). The currentTX > is a monotonic, unique identifier to order all changes. > The coordinator adds current transaction into an active transaction list > in RUNNING state. > Also, the coordinator informs the transaction about all transactions were > registered before its and now are in RUNNING state (excluding TX). > 2. Writing > > On Prepare stage the transaction get all necessary locks. > On Commit stage the transaction: > 1. To find current version in PK Index (minTS = currentTX or maxTS = 0) > 2. To find current version in Data Page by a link from #1 > 3. Insert new version into Data Page and set minTS = currentTX, maxTS=0 > 4. Insert new version into PK Index set minTS = currentTX set maxTS=0 and > set link to #2 > 5. Update current version in Data Page (find via link from #1) set maxTS = > currentTX > 6. Update current version in PK Index Page (find at #1) set maxTS = > currentTX > 7. Find current version in Secondary Index (minTS = currentTX or maxTS = 0) > 8. Insert new version into Secondary Index set minTS = currentTX set > maxTS=0 set link to #4 > 9. Update current version in Secondary Index set maxTS=currentTX > > Due to Two-Phase Commit it's not possible to get a case when two not > committed transactions update the same key. > For delete statements steps 3, 4, 8 should be skipped. > 3. Reading > > For a transaction with Repeatable Read, a value will be cached on Near > Node. > For other levels, a transaction should find a version for which the > following condition holds: > minTS >= CurrentTS and (maxTS < CurrentTS or maxTS = 0). Also minTS, maxTS > should not be in excluding TX list. > For Read Committed, CurrentTS is an identifier on start statement moment. > For Serialization level, CurrentTS is an identifier on start session > moment. In addition to that if a session has found that maxTS !=0 an error > "It's not possible to serialize the result" should be raised. > 4. Garbage collection > > During a Garbage collection procedure any entity's versions with maxTS > less than minimal currentTX for sessions with RUNNING state can be deleted. > There are no cases when we should show these values. minTS for next version > should be updated to 0. > 5. Sample > > As a sample, there are three write operation and check changes in data > structures. > > > > > > -- Alexey Kuznetsov