[
https://issues.apache.org/jira/browse/IGNITE-17263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
yexiaowei updated IGNITE-17263:
-------------------------------
Attachment: 20240409180024.jpg
> Implement leader to replica safe time propagation
> -------------------------------------------------
>
> Key: IGNITE-17263
> URL: https://issues.apache.org/jira/browse/IGNITE-17263
> Project: Ignite
> Issue Type: Improvement
> Reporter: Alexander Lapin
> Assignee: Denis Chudov
> Priority: Major
> Labels: ignite-3, transaction3_ro
> Fix For: 3.0.0-beta1
>
> Attachments: 20240409180024.jpg, Screenshot from 2022-07-06
> 16-48-30.png, Screenshot from 2022-07-06 16-48-41.png
>
> Time Spent: 40m
> Remaining Estimate: 0h
>
> In order to perform replica reads, it's required either to use read index or
> check the safe time. Let's recall corresponding section from tx design
> document.
> RO transactions can be executed on non-primary replicas. write intent
> resolution doesn’t help because a write intent for a committed transaction
> may not be yet replicated to the replica. To mitigate this issue, it’s enough
> to run readIndex on each mapped partition leader, fetch the commit index and
> wait on a replica until it’s applied. This will guarantee that all required
> write intents are replicated and present locally. After that the normal write
> intern resolution should do the job.
> There is a second option, which doesn’t require the network RTT. We can use a
> special low watermark timestamp (safeTs) per replication group, which
> corresponds to the apply index of a replicated entry, so then an apply index
> is advanced during the replication, then the safeTs is monotonically
> incremented too. The HLC used for safeTs advancing is assigned to a
> replicated entry in an ordered way.
> Special measures are needed to periodically advance the safeTs if no updates
> are happening. It’s enough to use a special replication command for this
> purpose.
> All we need during RO txn is to wait until a safeTs advances past the RO txn
> readTs.
> !Screenshot from 2022-07-06 16-48-30.png!
> In the picture we have two concurrent transactions mapped to the same
> partition: T1 and T2.
> OpReq(w1(x)) and OpReq(w2(x)) are received concurrently. Each write intent is
> assigned a timestamp in a monotonic order consistent with the replication
> order. This can be for example done when replication entries are dequeued for
> processing by replication protocol (we assume entries are replicated
> successively.
> It’s not enough only to wait for safeTs - it may never happen due to absence
> of activity in the partition. Consider the next diagram:
> !Screenshot from 2022-07-06 16-48-41.png!
> We need an additional safeTsSync command to propagate a safeTs event in case
> there are no updates in the partition.
> We need to linerialize safe time updates in all cases including leader
> change. So we need a guarantee that safe time on non-primary replicas never
> will be greater than HLC on leader (as we assume that primary replica is
> colocated with leader). We are going to solve this problem by associating
> every potential value of safeTime (propagated to the replica from leader via
> appendEntries) with some log index, and this value (safe time candidate)
> should be applied as new safe time value at the moment when corresponding
> index is committed.
> Hence, the safeTimeSyncCommand also should be a Raft write command.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)