[ https://issues.apache.org/jira/browse/IGNITE-16723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517301#comment-17517301 ]
Alexander Lapin commented on IGNITE-16723: ------------------------------------------ [~Sergey Uttsel] Looks good, thanks. > TX Recovery protocol in Cockroach in case of a failure of enlisted leaseholder > ------------------------------------------------------------------------------ > > Key: IGNITE-16723 > URL: https://issues.apache.org/jira/browse/IGNITE-16723 > Project: Ignite > Issue Type: Task > Reporter: Sergey Uttsel > Assignee: Sergey Uttsel > Priority: Major > Labels: ignite-3 > Attachments: readlock_lost.jpg, writelock_and_tx_record_lost.jpg > > > *Transaction recovery* is closely related to concurrency control. > Concurrency control is described in > [https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#lock-free-distributed-transactions] > (especially the chapter “Transaction interactions”) and > [https://dl.acm.org/doi/pdf/10.1145/3318464.3386134]. > > *Read locks.* > Most reads in CRDB don’t take any locks. Instead it puts a read timestamp to > timestamp cache. > Timestamp cache is a bounded in-memory cache that records the maximum > timestamp that key ranges were read from and written to. Cache corresponds to > the "status oracle" discussed in Yabandeh's A Critique of Snapshot Isolation. > The cache is updated after the completion of each read operation with the > range of all keys that the request was predicated upon. It is then consulted > for each write operation, allowing them to detect read-write violations that > would allow them to write "under" a read that has already been performed. > The cache is size-limited, so to prevent read-write conflicts for arbitrarily > old requests, it pessimistically maintains a “low water mark”. This value > always ratchets with monotonic increases and is equivalent to the earliest > timestamp of any key range that is present in the cache. If a write operation > writes to a key not present in the cache, the “low water mark” is consulted > instead to determine read-write conflicts. The low water mark is initialized > to the current system time plus the maximum clock offset. > On lease changing a timestamp cache snapshot is accepted on a new leaseholder > with a summary of the reads served on the range by prior leaseholders. This > can be used by the new leaseholder to ensure that no future writes are > allowed to invalidate prior reads. If a summary is not provided, for example > after a leaseholder failure, the method pessimistically assumes that prior > leaseholders served reads all the way up to the start of the new lease. > > Some reads, like SELECT FOR UPDATE take read locks, but it is local and will > be lost on leaseholder failure. In this case a “SELECT FOR UPDATE” request > falls back to a regular “SELECT”. > A range lock also uses a timestamp cache: > {code:java} > Add(start, end roachpb.Key, ts hlc.Timestamp, txnID uuid.UUID){code} > > *Write locks.* > CockroachDB has distributed write locks - write intents. An intent is a > regular MVCC KV pair, except that it is preceded by metadata indicating that > what follows is an intent. This metadata points to a transaction record, > which is a special key (unique per transaction) that stores the current > disposition of the transaction: pending, staging, committed or aborted. > Because write intents and tx records are replicated, they persist even after > the leaseholder falls. > > Theare is a topic with a discussion of this example on the CocroachDB forum: > [https://forum.cockroachlabs.com/t/read-write-tx-conflicts-on-leaseholder-failover/5213] -- This message was sent by Atlassian Jira (v8.20.1#820001)