[jira] [Updated] (IGNITE-16723) TX Recovery protocol in Cockroach in case of a failure of enlisted leaseholder

Sergey Uttsel (Jira) Wed, 06 Apr 2022 02:30:56 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-16723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sergey Uttsel updated IGNITE-16723:
-----------------------------------
    Description: 
*Transaction recovery* is closely related to concurrency control.

Concurrency control is described in 
[https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#lock-free-distributed-transactions]
 (especially the chapter “Transaction interactions”) and 
[https://dl.acm.org/doi/pdf/10.1145/3318464.3386134].

 

*Read locks.*

Most reads in CRDB don’t take any locks. Instead it puts a read timestamp to 
timestamp cache.

Timestamp cache is a bounded in-memory cache that records the maximum timestamp 
that key ranges were read from and written to. Cache corresponds to the "status 
oracle" discussed in Yabandeh's A Critique of Snapshot Isolation.

The cache is updated after the completion of each read operation with the range 
of all keys that the request was predicated upon. It is then consulted for each 
write operation, allowing them to detect read-write violations that would allow 
them to write "under" a read that has already been performed.

The cache is size-limited, so to prevent read-write conflicts for arbitrarily 
old requests, it pessimistically maintains a “low water mark”. This value 
always ratchets with monotonic increases and is equivalent to the earliest 
timestamp of any key range that is present in the cache. If a write operation 
writes to a key not present in the cache, the “low water mark” is consulted 
instead to determine read-write conflicts. The low water mark is initialized to 
the current system time plus the maximum clock offset.

On lease changing a timestamp cache snapshot is accepted on a new leaseholder 
with a summary of the reads served on the range by prior leaseholders. This can 
be used by the new leaseholder to ensure that no future writes are allowed to 
invalidate prior reads. If a summary is not provided, for example after a 
leaseholder failure, the method pessimistically assumes that prior leaseholders 
served reads all the way up to the start of the new lease.

 

Some reads, like SELECT FOR UPDATE take read locks, but it is local and will be 
lost on leaseholder failure. In this case a “SELECT FOR UPDATE” request falls 
back to a regular “SELECT”.

 

A key and a value of the timestamp cache is structures:
{code:java}
key {startKey, endKey}
value {timestamp, txnID}
{code}
A range lock also uses a timestamp cache:
{code:java}
Add(start, end roachpb.Key, ts hlc.Timestamp, txnID uuid.UUID){code}
 

*Write locks.*

CockroachDB has distributed write locks - write intents. An intent is a regular 
MVCC KV pair, except that it is preceded by metadata indicating that what 
follows is an intent. This metadata points to a transaction record, which is a 
special key (unique per transaction) that stores the current disposition of the 
transaction: pending, staging, committed or aborted. Because write intents and 
tx records are replicated, they persist even after the leaseholder falls.

 

Theare is a topic with a discussion of this example on the CocroachDB forum: 
[https://forum.cockroachlabs.com/t/read-write-tx-conflicts-on-leaseholder-failover/5213]

  was:
*Transaction recovery* is closely related to concurrency control.

Concurrency control is described in 
[https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#lock-free-distributed-transactions]
 (especially the chapter “Transaction interactions”) and 
[https://dl.acm.org/doi/pdf/10.1145/3318464.3386134].

 

*Read locks.*

Most reads in CRDB don’t take any locks. Instead it puts a read timestamp to 
timestamp cache.

Timestamp cache is a bounded in-memory cache that records the maximum timestamp 
that key ranges were read from and written to. Cache corresponds to the "status 
oracle" discussed in Yabandeh's A Critique of Snapshot Isolation.

The cache is updated after the completion of each read operation with the range 
of all keys that the request was predicated upon. It is then consulted for each 
write operation, allowing them to detect read-write violations that would allow 
them to write "under" a read that has already been performed.

The cache is size-limited, so to prevent read-write conflicts for arbitrarily 
old requests, it pessimistically maintains a “low water mark”. This value 
always ratchets with monotonic increases and is equivalent to the earliest 
timestamp of any key range that is present in the cache. If a write operation 
writes to a key not present in the cache, the “low water mark” is consulted 
instead to determine read-write conflicts. The low water mark is initialized to 
the current system time plus the maximum clock offset.

On lease changing a timestamp cache snapshot is accepted on a new leaseholder 
with a summary of the reads served on the range by prior leaseholders. This can 
be used by the new leaseholder to ensure that no future writes are allowed to 
invalidate prior reads. If a summary is not provided, for example after a 
leaseholder failure, the method pessimistically assumes that prior leaseholders 
served reads all the way up to the start of the new lease.

 

Some reads, like SELECT FOR UPDATE take read locks, but it is local and will be 
lost on leaseholder failure. In this case a “SELECT FOR UPDATE” request falls 
back to a regular “SELECT”.

 

A key and a value of the timestamp cache is structures:

 
{code:java}
key {startKey, endKey}
value {timestamp, txnID}
{code}
A range lock also uses a timestamp cache:
{code:java}
Add(start, end roachpb.Key, ts hlc.Timestamp, txnID uuid.UUID){code}
 

*Write locks.*

CockroachDB has distributed write locks - write intents. An intent is a regular 
MVCC KV pair, except that it is preceded by metadata indicating that what 
follows is an intent. This metadata points to a transaction record, which is a 
special key (unique per transaction) that stores the current disposition of the 
transaction: pending, staging, committed or aborted. Because write intents and 
tx records are replicated, they persist even after the leaseholder falls.

 

Theare is a topic with a discussion of this example on the CocroachDB forum: 
[https://forum.cockroachlabs.com/t/read-write-tx-conflicts-on-leaseholder-failover/5213]


> TX Recovery protocol in Cockroach in case of a failure of enlisted leaseholder
> ------------------------------------------------------------------------------
>
>                 Key: IGNITE-16723
>                 URL: https://issues.apache.org/jira/browse/IGNITE-16723
>             Project: Ignite
>          Issue Type: Task
>            Reporter: Sergey Uttsel
>            Assignee: Sergey Uttsel
>            Priority: Major
>              Labels: ignite-3
>         Attachments: readlock_lost.jpg, writelock_and_tx_record_lost.jpg
>
>
> *Transaction recovery* is closely related to concurrency control.
> Concurrency control is described in 
> [https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#lock-free-distributed-transactions]
>  (especially the chapter “Transaction interactions”) and 
> [https://dl.acm.org/doi/pdf/10.1145/3318464.3386134].
>  
> *Read locks.*
> Most reads in CRDB don’t take any locks. Instead it puts a read timestamp to 
> timestamp cache.
> Timestamp cache is a bounded in-memory cache that records the maximum 
> timestamp that key ranges were read from and written to. Cache corresponds to 
> the "status oracle" discussed in Yabandeh's A Critique of Snapshot Isolation.
> The cache is updated after the completion of each read operation with the 
> range of all keys that the request was predicated upon. It is then consulted 
> for each write operation, allowing them to detect read-write violations that 
> would allow them to write "under" a read that has already been performed.
> The cache is size-limited, so to prevent read-write conflicts for arbitrarily 
> old requests, it pessimistically maintains a “low water mark”. This value 
> always ratchets with monotonic increases and is equivalent to the earliest 
> timestamp of any key range that is present in the cache. If a write operation 
> writes to a key not present in the cache, the “low water mark” is consulted 
> instead to determine read-write conflicts. The low water mark is initialized 
> to the current system time plus the maximum clock offset.
> On lease changing a timestamp cache snapshot is accepted on a new leaseholder 
> with a summary of the reads served on the range by prior leaseholders. This 
> can be used by the new leaseholder to ensure that no future writes are 
> allowed to invalidate prior reads. If a summary is not provided, for example 
> after a leaseholder failure, the method pessimistically assumes that prior 
> leaseholders served reads all the way up to the start of the new lease.
>  
> Some reads, like SELECT FOR UPDATE take read locks, but it is local and will 
> be lost on leaseholder failure. In this case a “SELECT FOR UPDATE” request 
> falls back to a regular “SELECT”.
>  
> A key and a value of the timestamp cache is structures:
> {code:java}
> key {startKey, endKey}
> value {timestamp, txnID}
> {code}
> A range lock also uses a timestamp cache:
> {code:java}
> Add(start, end roachpb.Key, ts hlc.Timestamp, txnID uuid.UUID){code}
>  
> *Write locks.*
> CockroachDB has distributed write locks - write intents. An intent is a 
> regular MVCC KV pair, except that it is preceded by metadata indicating that 
> what follows is an intent. This metadata points to a transaction record, 
> which is a special key (unique per transaction) that stores the current 
> disposition of the transaction: pending, staging, committed or aborted. 
> Because write intents and tx records are replicated, they persist even after 
> the leaseholder falls.
>  
> Theare is a topic with a discussion of this example on the CocroachDB forum: 
> [https://forum.cockroachlabs.com/t/read-write-tx-conflicts-on-leaseholder-failover/5213]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (IGNITE-16723) TX Recovery protocol in Cockroach in case of a failure of enlisted leaseholder

Reply via email to