[
https://issues.apache.org/jira/browse/IGNITE-16723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sergey Uttsel updated IGNITE-16723:
-----------------------------------
Description:
*Transaction recovery* is closely related to concurrency control.
Concurrency control is described in
[https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#lock-free-distributed-transactions]
(especially the chapter “Transaction interactions”) and
[https://dl.acm.org/doi/pdf/10.1145/3318464.3386134].
*Read locks.*
Most reads in CRDB don’t take any locks. Instead it puts a read timestamp to
timestamp cache.
Timestamp cache is a bounded in-memory cache that records the maximum timestamp
that key ranges were read from and written to. Cache corresponds to the "status
oracle" discussed in Yabandeh's A Critique of Snapshot Isolation.
The cache is updated after the completion of each read operation with the range
of all keys that the request was predicated upon. It is then consulted for each
write operation, allowing them to detect read-write violations that would allow
them to write "under" a read that has already been performed.
The cache is size-limited, so to prevent read-write conflicts for arbitrarily
old requests, it pessimistically maintains a “low water mark”. This value
always ratchets with monotonic increases and is equivalent to the earliest
timestamp of any key range that is present in the cache. If a write operation
writes to a key not present in the cache, the “low water mark” is consulted
instead to determine read-write conflicts. The low water mark is initialized to
the current system time plus the maximum clock offset.
On lease changing a timestamp cache snapshot is accepted on a new leaseholder
with a summary of the reads served on the range by prior leaseholders. This can
be used by the new leaseholder to ensure that no future writes are allowed to
invalidate prior reads. If a summary is not provided, for example after a
leaseholder failure, the method pessimistically assumes that prior leaseholders
served reads all the way up to the start of the new lease.
Some reads, like SELECT FOR UPDATE take read locks, but it is local and will be
lost on leaseholder failure. In this case a “SELECT FOR UPDATE” request falls
back to a regular “SELECT”.
A key and a value of the timestamp cache is structures:
{code:java}
key {startKey, endKey}
value {timestamp, txnID}
{code}
A range lock also uses a timestamp cache:
{code:java}
Add(start, end roachpb.Key, ts hlc.Timestamp, txnID uuid.UUID){code}
*Write locks.*
CockroachDB has distributed write locks - write intents. An intent is a regular
MVCC KV pair, except that it is preceded by metadata indicating that what
follows is an intent. This metadata points to a transaction record, which is a
special key (unique per transaction) that stores the current disposition of the
transaction: pending, staging, committed or aborted. Because write intents and
tx records are replicated, they persist even after the leaseholder falls.
Theare is a topic with a discussion of this example on the CocroachDB forum:
[https://forum.cockroachlabs.com/t/read-write-tx-conflicts-on-leaseholder-failover/5213]
was:
*Transaction recovery* is closely related to concurrency control.
Concurrency control is described in
[https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#lock-free-distributed-transactions]
(especially the chapter “Transaction interactions”) and
[https://dl.acm.org/doi/pdf/10.1145/3318464.3386134].
*Read locks.*
Most reads in CRDB don’t take any locks. Instead it puts a read timestamp to
timestamp cache.
Timestamp cache is a bounded in-memory cache that records the maximum timestamp
that key ranges were read from and written to. Cache corresponds to the "status
oracle" discussed in Yabandeh's A Critique of Snapshot Isolation.
The cache is updated after the completion of each read operation with the range
of all keys that the request was predicated upon. It is then consulted for each
write operation, allowing them to detect read-write violations that would allow
them to write "under" a read that has already been performed.
The cache is size-limited, so to prevent read-write conflicts for arbitrarily
old requests, it pessimistically maintains a “low water mark”. This value
always ratchets with monotonic increases and is equivalent to the earliest
timestamp of any key range that is present in the cache. If a write operation
writes to a key not present in the cache, the “low water mark” is consulted
instead to determine read-write conflicts. The low water mark is initialized to
the current system time plus the maximum clock offset.
On lease changing a timestamp cache snapshot is accepted on a new leaseholder
with a summary of the reads served on the range by prior leaseholders. This can
be used by the new leaseholder to ensure that no future writes are allowed to
invalidate prior reads. If a summary is not provided, for example after a
leaseholder failure, the method pessimistically assumes that prior leaseholders
served reads all the way up to the start of the new lease.
Some reads, like SELECT FOR UPDATE take read locks, but it is local and will be
lost on leaseholder failure. In this case a “SELECT FOR UPDATE” request falls
back to a regular “SELECT”.
A key and a value of the timestamp cache is structures:
{code:java}
key {startKey, endKey}
value {timestamp, txnID}
{code}
A range lock also uses a timestamp cache:
{code:java}
Add(start, end roachpb.Key, ts hlc.Timestamp, txnID uuid.UUID){code}
*Write locks.*
CockroachDB has distributed write locks - write intents. An intent is a regular
MVCC KV pair, except that it is preceded by metadata indicating that what
follows is an intent. This metadata points to a transaction record, which is a
special key (unique per transaction) that stores the current disposition of the
transaction: pending, staging, committed or aborted. Because write intents and
tx records are replicated, they persist even after the leaseholder falls.
Theare is a topic with a discussion of this example on the CocroachDB forum:
[https://forum.cockroachlabs.com/t/read-write-tx-conflicts-on-leaseholder-failover/5213]
> TX Recovery protocol in Cockroach in case of a failure of enlisted leaseholder
> ------------------------------------------------------------------------------
>
> Key: IGNITE-16723
> URL: https://issues.apache.org/jira/browse/IGNITE-16723
> Project: Ignite
> Issue Type: Task
> Reporter: Sergey Uttsel
> Assignee: Sergey Uttsel
> Priority: Major
> Labels: ignite-3
> Attachments: readlock_lost.jpg, writelock_and_tx_record_lost.jpg
>
>
> *Transaction recovery* is closely related to concurrency control.
> Concurrency control is described in
> [https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#lock-free-distributed-transactions]
> (especially the chapter “Transaction interactions”) and
> [https://dl.acm.org/doi/pdf/10.1145/3318464.3386134].
>
> *Read locks.*
> Most reads in CRDB don’t take any locks. Instead it puts a read timestamp to
> timestamp cache.
> Timestamp cache is a bounded in-memory cache that records the maximum
> timestamp that key ranges were read from and written to. Cache corresponds to
> the "status oracle" discussed in Yabandeh's A Critique of Snapshot Isolation.
> The cache is updated after the completion of each read operation with the
> range of all keys that the request was predicated upon. It is then consulted
> for each write operation, allowing them to detect read-write violations that
> would allow them to write "under" a read that has already been performed.
> The cache is size-limited, so to prevent read-write conflicts for arbitrarily
> old requests, it pessimistically maintains a “low water mark”. This value
> always ratchets with monotonic increases and is equivalent to the earliest
> timestamp of any key range that is present in the cache. If a write operation
> writes to a key not present in the cache, the “low water mark” is consulted
> instead to determine read-write conflicts. The low water mark is initialized
> to the current system time plus the maximum clock offset.
> On lease changing a timestamp cache snapshot is accepted on a new leaseholder
> with a summary of the reads served on the range by prior leaseholders. This
> can be used by the new leaseholder to ensure that no future writes are
> allowed to invalidate prior reads. If a summary is not provided, for example
> after a leaseholder failure, the method pessimistically assumes that prior
> leaseholders served reads all the way up to the start of the new lease.
>
> Some reads, like SELECT FOR UPDATE take read locks, but it is local and will
> be lost on leaseholder failure. In this case a “SELECT FOR UPDATE” request
> falls back to a regular “SELECT”.
>
> A key and a value of the timestamp cache is structures:
> {code:java}
> key {startKey, endKey}
> value {timestamp, txnID}
> {code}
> A range lock also uses a timestamp cache:
> {code:java}
> Add(start, end roachpb.Key, ts hlc.Timestamp, txnID uuid.UUID){code}
>
> *Write locks.*
> CockroachDB has distributed write locks - write intents. An intent is a
> regular MVCC KV pair, except that it is preceded by metadata indicating that
> what follows is an intent. This metadata points to a transaction record,
> which is a special key (unique per transaction) that stores the current
> disposition of the transaction: pending, staging, committed or aborted.
> Because write intents and tx records are replicated, they persist even after
> the leaseholder falls.
>
> Theare is a topic with a discussion of this example on the CocroachDB forum:
> [https://forum.cockroachlabs.com/t/read-write-tx-conflicts-on-leaseholder-failover/5213]
--
This message was sent by Atlassian Jira
(v8.20.1#820001)