[
https://issues.apache.org/jira/browse/IGNITE-16723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sergey Uttsel updated IGNITE-16723:
-----------------------------------
Description:
*Transaction recovery* is closely related to concurrency control.
Concurrency control is described in
[https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#lock-free-distributed-transactions]
(especially the chapter “Transaction interactions”) and
[https://dl.acm.org/doi/pdf/10.1145/3318464.3386134].
*Read locks.*
Most reads in CRDB don’t take any locks. Instead it puts a read timestamp to
timestamp cache.
Timestamp cache is a bounded in-memory cache that records the maximum timestamp
that key ranges were read from and written to. Cache corresponds to the "status
oracle" discussed in Yabandeh's A Critique of Snapshot Isolation.
The cache is updated after the completion of each read operation with the range
of all keys that the request was predicated upon. It is then consulted for each
write operation, allowing them to detect read-write violations that would allow
them to write "under" a read that has already been performed.
The cache is size-limited, so to prevent read-write conflicts for arbitrarily
old requests, it pessimistically maintains a “low water mark”. This value
always ratchets with monotonic increases and is equivalent to the earliest
timestamp of any key range that is present in the cache. If a write operation
writes to a key not present in the cache, the “low water mark” is consulted
instead to determine read-write conflicts. The low water mark is initialized to
the current system time plus the maximum clock offset.
On lease changing a timestamp cache snapshot is accepted on a new leaseholder
with a summary of the reads served on the range by prior leaseholders. This can
be used by the new leaseholder to ensure that no future writes are allowed to
invalidate prior reads. If a summary is not provided, for example after a
leaseholder failure, the method pessimistically assumes that prior leaseholders
served reads all the way up to the start of the new lease.
Some reads, like SELECT FOR UPDATE take read locks, but it is local and will be
lost on leaseholder failure. In this case a “SELECT FOR UPDATE” request falls
back to a regular “SELECT”.
A range lock also uses a timestamp cache:
{code:java}
Add(start, end roachpb.Key, ts hlc.Timestamp, txnID uuid.UUID){code}
*Write locks.*
CockroachDB has distributed write locks - write intents. An intent is a regular
MVCC KV pair, except that it is preceded by metadata indicating that what
follows is an intent. This metadata points to a transaction record, which is a
special key (unique per transaction) that stores the current disposition of the
transaction: pending, staging, committed or aborted. Because write intents and
tx records are replicated, they persist even after the leaseholder falls.
Theare is a topic with a discussion of this example on the CocroachDB forum:
[https://forum.cockroachlabs.com/t/read-write-tx-conflicts-on-leaseholder-failover/5213]
was:
Need to investigate:
# how it is guaranteed that the commit timestamp is in the leaseholder
intervals of all enlisted partitions in case of a failure of txn enlisted
leaseholder.
# how to restore a lock for a read operation in case of a failure of txn
enlisted leaseholder.
For example:
tx1.start
v1 = tx1.read(k1) in range1
tx1.write(k2, v1) in range2
Start to commit tx1 (the write intent was replicated, a commit timestamp known,
but a commit request was not sent yet)
Leaseholder of range1 has failed, the lease has expired. A new leaseholder was
elected.
tx2.start
tx2.write(k1, v2) in range1
tx2.commit
tx2.end
Transaction record of tx1 is committed now.
tx1.end
I started a topic with a discussion of this example on the CocroachDB forum
https://forum.cockroachlabs.com/t/read-write-tx-conflicts-on-leaseholder-failover/5213
> TX Recovery protocol in Cockroach in case of a failure of enlisted leaseholder
> ------------------------------------------------------------------------------
>
> Key: IGNITE-16723
> URL: https://issues.apache.org/jira/browse/IGNITE-16723
> Project: Ignite
> Issue Type: Task
> Reporter: Sergey Uttsel
> Assignee: Sergey Uttsel
> Priority: Major
> Labels: ignite-3
> Attachments: readlock_lost.jpg, writelock_and_tx_record_lost.jpg
>
>
> *Transaction recovery* is closely related to concurrency control.
> Concurrency control is described in
> [https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#lock-free-distributed-transactions]
> (especially the chapter “Transaction interactions”) and
> [https://dl.acm.org/doi/pdf/10.1145/3318464.3386134].
>
> *Read locks.*
> Most reads in CRDB don’t take any locks. Instead it puts a read timestamp to
> timestamp cache.
> Timestamp cache is a bounded in-memory cache that records the maximum
> timestamp that key ranges were read from and written to. Cache corresponds to
> the "status oracle" discussed in Yabandeh's A Critique of Snapshot Isolation.
> The cache is updated after the completion of each read operation with the
> range of all keys that the request was predicated upon. It is then consulted
> for each write operation, allowing them to detect read-write violations that
> would allow them to write "under" a read that has already been performed.
> The cache is size-limited, so to prevent read-write conflicts for arbitrarily
> old requests, it pessimistically maintains a “low water mark”. This value
> always ratchets with monotonic increases and is equivalent to the earliest
> timestamp of any key range that is present in the cache. If a write operation
> writes to a key not present in the cache, the “low water mark” is consulted
> instead to determine read-write conflicts. The low water mark is initialized
> to the current system time plus the maximum clock offset.
> On lease changing a timestamp cache snapshot is accepted on a new leaseholder
> with a summary of the reads served on the range by prior leaseholders. This
> can be used by the new leaseholder to ensure that no future writes are
> allowed to invalidate prior reads. If a summary is not provided, for example
> after a leaseholder failure, the method pessimistically assumes that prior
> leaseholders served reads all the way up to the start of the new lease.
>
> Some reads, like SELECT FOR UPDATE take read locks, but it is local and will
> be lost on leaseholder failure. In this case a “SELECT FOR UPDATE” request
> falls back to a regular “SELECT”.
> A range lock also uses a timestamp cache:
> {code:java}
> Add(start, end roachpb.Key, ts hlc.Timestamp, txnID uuid.UUID){code}
>
> *Write locks.*
> CockroachDB has distributed write locks - write intents. An intent is a
> regular MVCC KV pair, except that it is preceded by metadata indicating that
> what follows is an intent. This metadata points to a transaction record,
> which is a special key (unique per transaction) that stores the current
> disposition of the transaction: pending, staging, committed or aborted.
> Because write intents and tx records are replicated, they persist even after
> the leaseholder falls.
>
> Theare is a topic with a discussion of this example on the CocroachDB forum:
> [https://forum.cockroachlabs.com/t/read-write-tx-conflicts-on-leaseholder-failover/5213]
--
This message was sent by Atlassian Jira
(v8.20.1#820001)