[ 
https://issues.apache.org/jira/browse/IGNITE-16723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517301#comment-17517301
 ] 

Alexander Lapin commented on IGNITE-16723:
------------------------------------------

[~Sergey Uttsel] Looks good, thanks.

> TX Recovery protocol in Cockroach in case of a failure of enlisted leaseholder
> ------------------------------------------------------------------------------
>
>                 Key: IGNITE-16723
>                 URL: https://issues.apache.org/jira/browse/IGNITE-16723
>             Project: Ignite
>          Issue Type: Task
>            Reporter: Sergey Uttsel
>            Assignee: Sergey Uttsel
>            Priority: Major
>              Labels: ignite-3
>         Attachments: readlock_lost.jpg, writelock_and_tx_record_lost.jpg
>
>
> *Transaction recovery* is closely related to concurrency control.
> Concurrency control is described in 
> [https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#lock-free-distributed-transactions]
>  (especially the chapter “Transaction interactions”) and 
> [https://dl.acm.org/doi/pdf/10.1145/3318464.3386134].
>  
> *Read locks.*
> Most reads in CRDB don’t take any locks. Instead it puts a read timestamp to 
> timestamp cache.
> Timestamp cache is a bounded in-memory cache that records the maximum 
> timestamp that key ranges were read from and written to. Cache corresponds to 
> the "status oracle" discussed in Yabandeh's A Critique of Snapshot Isolation.
> The cache is updated after the completion of each read operation with the 
> range of all keys that the request was predicated upon. It is then consulted 
> for each write operation, allowing them to detect read-write violations that 
> would allow them to write "under" a read that has already been performed.
> The cache is size-limited, so to prevent read-write conflicts for arbitrarily 
> old requests, it pessimistically maintains a “low water mark”. This value 
> always ratchets with monotonic increases and is equivalent to the earliest 
> timestamp of any key range that is present in the cache. If a write operation 
> writes to a key not present in the cache, the “low water mark” is consulted 
> instead to determine read-write conflicts. The low water mark is initialized 
> to the current system time plus the maximum clock offset.
> On lease changing a timestamp cache snapshot is accepted on a new leaseholder 
> with a summary of the reads served on the range by prior leaseholders. This 
> can be used by the new leaseholder to ensure that no future writes are 
> allowed to invalidate prior reads. If a summary is not provided, for example 
> after a leaseholder failure, the method pessimistically assumes that prior 
> leaseholders served reads all the way up to the start of the new lease.
>  
> Some reads, like SELECT FOR UPDATE take read locks, but it is local and will 
> be lost on leaseholder failure. In this case a “SELECT FOR UPDATE” request 
> falls back to a regular “SELECT”.
> A range lock also uses a timestamp cache: 
> {code:java}
> Add(start, end roachpb.Key, ts hlc.Timestamp, txnID uuid.UUID){code}
>  
> *Write locks.*
> CockroachDB has distributed write locks - write intents. An intent is a 
> regular MVCC KV pair, except that it is preceded by metadata indicating that 
> what follows is an intent. This metadata points to a transaction record, 
> which is a special key (unique per transaction) that stores the current 
> disposition of the transaction: pending, staging, committed or aborted. 
> Because write intents and tx records are replicated, they persist even after 
> the leaseholder falls.
>  
> Theare is a topic with a discussion of this example on the CocroachDB forum: 
> [https://forum.cockroachlabs.com/t/read-write-tx-conflicts-on-leaseholder-failover/5213]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to