[ 
https://issues.apache.org/jira/browse/IGNITE-17263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17619665#comment-17619665
 ] 

Denis Chudov commented on IGNITE-17263:
---------------------------------------

The fact that safe time sync command that is used for idle safe time 
propagation is a write command, can cause some problems on idle cluster, 
because even on idle cluster raft log will constantly grow up. It will increase 
the join time of a node that was offline for some time, and will cause loading 
of a snapshot of whole raft storage if the log was compacted. We can put up 
with this problem for now, but in future we can think about some optimizations.

Possibly we can come up with some heuristics depending on read-only load on 
idle cluster. Possibly, most idle safe time propagation commands can be 
triggered by read-only requests.

When we implement the lease-based primary replicas, we will be able to 
propagate safe time via messaging, as lease mechanism will guarantee that there 
will be no primary replica that is able to send its HLC value as safe time that 
is less than the safe time sent by previous primary replica.


> Implement leader to replica safe time propagation
> -------------------------------------------------
>
>                 Key: IGNITE-17263
>                 URL: https://issues.apache.org/jira/browse/IGNITE-17263
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Alexander Lapin
>            Assignee: Denis Chudov
>            Priority: Major
>              Labels: ignite-3, transaction3_ro
>         Attachments: Screenshot from 2022-07-06 16-48-30.png, Screenshot from 
> 2022-07-06 16-48-41.png
>
>
> In order to perform replica reads, it's required either to use read index or 
> check the safe time. Let's recall corresponding section from tx design 
> document.
> RO transactions can be executed on non-primary replicas. write intent 
> resolution doesn’t help because a write intent for a committed transaction 
> may not be yet replicated to the replica. To mitigate this issue, it’s enough 
> to run readIndex on each mapped partition leader, fetch the commit index and 
> wait on a replica until it’s applied. This will guarantee that all required 
> write intents are replicated and present locally. After that the normal write 
> intern resolution should do the job.
> There is a second option, which doesn’t require the network RTT. We can use a 
> special low watermark timestamp (safeTs) per replication group, which 
> corresponds to the apply index of a replicated entry, so then an apply index 
> is advanced during the replication, then the safeTs is monotonically 
> incremented too. The HLC used for safeTs advancing is assigned to a 
> replicated entry in an ordered way.
> Special measures are needed to periodically advance the safeTs if no updates 
> are happening. It’s enough to use a special replication command for this 
> purpose.
> All we need during RO txn is to wait until a safeTs advances past the RO txn 
> readTs. 
>  !Screenshot from 2022-07-06 16-48-30.png! 
> In the picture we have two concurrent transactions mapped to the same 
> partition: T1 and T2.
> OpReq(w1(x)) and OpReq(w2(x)) are received concurrently. Each write intent is 
> assigned a timestamp in a monotonic order consistent with the replication 
> order. This can be for example done when replication entries are dequeued for 
> processing by replication protocol (we assume entries are replicated 
> successively.
> It’s not enough only to wait for safeTs - it may never happen due to absence 
> of activity in the partition. Consider the next diagram:
>  !Screenshot from 2022-07-06 16-48-41.png! 
> We need an additional safeTsSync command to propagate a safeTs event in case 
> there are no updates in the partition.
> We need to linerialize safe time updates in all cases including leader 
> change. So we need a guarantee that safe time on non-primary replicas never 
> will be greater than HLC on leader (as we assume that primary replica is 
> colocated with leader). We are going to solve this problem by associating 
> every potential value of safeTime (propagated to the replica from leader via 
> appendEntries) with some log index, and this value (safe time candidate) 
> should be applied as new safe time value at the moment when corresponding 
> index is committed.
> Hence, the safeTimeSyncCommand also should be a Raft write command.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to