[ 
https://issues.apache.org/jira/browse/IGNITE-16723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Uttsel updated IGNITE-16723:
-----------------------------------
    Description: 
*Transaction recovery* is closely related to concurrency control.

Concurrency control is described in 
[https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#lock-free-distributed-transactions]
 (especially the chapter “Transaction interactions”) and 
[https://dl.acm.org/doi/pdf/10.1145/3318464.3386134].

 

*Read locks.*

Most reads in CRDB don’t take any locks. Instead it puts a read timestamp to 
timestamp cache.

Timestamp cache is a bounded in-memory cache that records the maximum timestamp 
that key ranges were read from and written to. Cache corresponds to the "status 
oracle" discussed in Yabandeh's A Critique of Snapshot Isolation.

The cache is updated after the completion of each read operation with the range 
of all keys that the request was predicated upon. It is then consulted for each 
write operation, allowing them to detect read-write violations that would allow 
them to write "under" a read that has already been performed.

The cache is size-limited, so to prevent read-write conflicts for arbitrarily 
old requests, it pessimistically maintains a “low water mark”. This value 
always ratchets with monotonic increases and is equivalent to the earliest 
timestamp of any key range that is present in the cache. If a write operation 
writes to a key not present in the cache, the “low water mark” is consulted 
instead to determine read-write conflicts. The low water mark is initialized to 
the current system time plus the maximum clock offset.

On lease changing a timestamp cache snapshot is accepted on a new leaseholder 
with a summary of the reads served on the range by prior leaseholders. This can 
be used by the new leaseholder to ensure that no future writes are allowed to 
invalidate prior reads. If a summary is not provided, for example after a 
leaseholder failure, the method pessimistically assumes that prior leaseholders 
served reads all the way up to the start of the new lease.

 

Some reads, like SELECT FOR UPDATE take read locks, but it is local and will be 
lost on leaseholder failure. In this case a “SELECT FOR UPDATE” request falls 
back to a regular “SELECT”.

A range lock also uses a timestamp cache: 
{code:java}
Add(start, end roachpb.Key, ts hlc.Timestamp, txnID uuid.UUID){code}
 

*Write locks.*

CockroachDB has distributed write locks - write intents. An intent is a regular 
MVCC KV pair, except that it is preceded by metadata indicating that what 
follows is an intent. This metadata points to a transaction record, which is a 
special key (unique per transaction) that stores the current disposition of the 
transaction: pending, staging, committed or aborted. Because write intents and 
tx records are replicated, they persist even after the leaseholder falls.

 

Theare is a topic with a discussion of this example on the CocroachDB forum: 
[https://forum.cockroachlabs.com/t/read-write-tx-conflicts-on-leaseholder-failover/5213]

  was:
Need to investigate:
 # how it is guaranteed that the commit timestamp is in the leaseholder 
intervals of all enlisted partitions in case of a failure of txn enlisted 
leaseholder.
 # how to restore a lock for a read operation in case of a failure of txn 
enlisted leaseholder.

 

For example:

tx1.start

v1 = tx1.read(k1) in range1

tx1.write(k2, v1) in range2

Start to commit tx1 (the write intent was replicated, a commit timestamp known, 
but a commit request was not sent yet)

Leaseholder of range1 has failed, the lease has expired. A new leaseholder was 
elected.

tx2.start

tx2.write(k1, v2) in range1

tx2.commit

tx2.end

Transaction record of tx1 is committed now.

tx1.end

 

I started a topic with a discussion of this example on the CocroachDB forum

https://forum.cockroachlabs.com/t/read-write-tx-conflicts-on-leaseholder-failover/5213


> TX Recovery protocol in Cockroach in case of a failure of enlisted leaseholder
> ------------------------------------------------------------------------------
>
>                 Key: IGNITE-16723
>                 URL: https://issues.apache.org/jira/browse/IGNITE-16723
>             Project: Ignite
>          Issue Type: Task
>            Reporter: Sergey Uttsel
>            Assignee: Sergey Uttsel
>            Priority: Major
>              Labels: ignite-3
>         Attachments: readlock_lost.jpg, writelock_and_tx_record_lost.jpg
>
>
> *Transaction recovery* is closely related to concurrency control.
> Concurrency control is described in 
> [https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#lock-free-distributed-transactions]
>  (especially the chapter “Transaction interactions”) and 
> [https://dl.acm.org/doi/pdf/10.1145/3318464.3386134].
>  
> *Read locks.*
> Most reads in CRDB don’t take any locks. Instead it puts a read timestamp to 
> timestamp cache.
> Timestamp cache is a bounded in-memory cache that records the maximum 
> timestamp that key ranges were read from and written to. Cache corresponds to 
> the "status oracle" discussed in Yabandeh's A Critique of Snapshot Isolation.
> The cache is updated after the completion of each read operation with the 
> range of all keys that the request was predicated upon. It is then consulted 
> for each write operation, allowing them to detect read-write violations that 
> would allow them to write "under" a read that has already been performed.
> The cache is size-limited, so to prevent read-write conflicts for arbitrarily 
> old requests, it pessimistically maintains a “low water mark”. This value 
> always ratchets with monotonic increases and is equivalent to the earliest 
> timestamp of any key range that is present in the cache. If a write operation 
> writes to a key not present in the cache, the “low water mark” is consulted 
> instead to determine read-write conflicts. The low water mark is initialized 
> to the current system time plus the maximum clock offset.
> On lease changing a timestamp cache snapshot is accepted on a new leaseholder 
> with a summary of the reads served on the range by prior leaseholders. This 
> can be used by the new leaseholder to ensure that no future writes are 
> allowed to invalidate prior reads. If a summary is not provided, for example 
> after a leaseholder failure, the method pessimistically assumes that prior 
> leaseholders served reads all the way up to the start of the new lease.
>  
> Some reads, like SELECT FOR UPDATE take read locks, but it is local and will 
> be lost on leaseholder failure. In this case a “SELECT FOR UPDATE” request 
> falls back to a regular “SELECT”.
> A range lock also uses a timestamp cache: 
> {code:java}
> Add(start, end roachpb.Key, ts hlc.Timestamp, txnID uuid.UUID){code}
>  
> *Write locks.*
> CockroachDB has distributed write locks - write intents. An intent is a 
> regular MVCC KV pair, except that it is preceded by metadata indicating that 
> what follows is an intent. This metadata points to a transaction record, 
> which is a special key (unique per transaction) that stores the current 
> disposition of the transaction: pending, staging, committed or aborted. 
> Because write intents and tx records are replicated, they persist even after 
> the leaseholder falls.
>  
> Theare is a topic with a discussion of this example on the CocroachDB forum: 
> [https://forum.cockroachlabs.com/t/read-write-tx-conflicts-on-leaseholder-failover/5213]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to