[
https://issues.apache.org/jira/browse/IGNITE-16723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sergey Uttsel updated IGNITE-16723:
-----------------------------------
Attachment: writelock_and_tx_record_lost.jpg
> TX Recovery protocol in Cockroach in case of a failure of enlisted leaseholder
> ------------------------------------------------------------------------------
>
> Key: IGNITE-16723
> URL: https://issues.apache.org/jira/browse/IGNITE-16723
> Project: Ignite
> Issue Type: Task
> Reporter: Sergey Uttsel
> Assignee: Sergey Uttsel
> Priority: Major
> Labels: ignite-3
> Attachments: readlock_lost.jpg, writelock_and_tx_record_lost.jpg
>
>
> *Transaction recovery* is closely related to concurrency control.
> Concurrency control is described in
> [https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#lock-free-distributed-transactions]
> (especially the chapter “Transaction interactions”) and
> [https://dl.acm.org/doi/pdf/10.1145/3318464.3386134].
>
> *Read locks.*
> Most reads in CRDB don’t take any locks. Instead it puts a read timestamp to
> timestamp cache.
> Timestamp cache is a bounded in-memory cache that records the maximum
> timestamp that key ranges were read from and written to. Cache corresponds to
> the "status oracle" discussed in Yabandeh's A Critique of Snapshot Isolation.
> The cache is updated after the completion of each read operation with the
> range of all keys that the request was predicated upon. It is then consulted
> for each write operation, allowing them to detect read-write violations that
> would allow them to write "under" a read that has already been performed.
> The cache is size-limited, so to prevent read-write conflicts for arbitrarily
> old requests, it pessimistically maintains a “low water mark”. This value
> always ratchets with monotonic increases and is equivalent to the earliest
> timestamp of any key range that is present in the cache. If a write operation
> writes to a key not present in the cache, the “low water mark” is consulted
> instead to determine read-write conflicts. The low water mark is initialized
> to the current system time plus the maximum clock offset.
> On lease changing a timestamp cache snapshot is accepted on a new leaseholder
> with a summary of the reads served on the range by prior leaseholders. This
> can be used by the new leaseholder to ensure that no future writes are
> allowed to invalidate prior reads. If a summary is not provided, for example
> after a leaseholder failure, the method pessimistically assumes that prior
> leaseholders served reads all the way up to the start of the new lease.
>
> Some reads, like SELECT FOR UPDATE take read locks, but it is local and will
> be lost on leaseholder failure. In this case a “SELECT FOR UPDATE” request
> falls back to a regular “SELECT”.
> A range lock also uses a timestamp cache:
> {code:java}
> Add(start, end roachpb.Key, ts hlc.Timestamp, txnID uuid.UUID){code}
>
> *Write locks.*
> CockroachDB has distributed write locks - write intents. An intent is a
> regular MVCC KV pair, except that it is preceded by metadata indicating that
> what follows is an intent. This metadata points to a transaction record,
> which is a special key (unique per transaction) that stores the current
> disposition of the transaction: pending, staging, committed or aborted.
> Because write intents and tx records are replicated, they persist even after
> the leaseholder falls.
>
> Theare is a topic with a discussion of this example on the CocroachDB forum:
> [https://forum.cockroachlabs.com/t/read-write-tx-conflicts-on-leaseholder-failover/5213]
--
This message was sent by Atlassian Jira
(v8.20.1#820001)