[jira] [Updated] (IGNITE-17263) Implement leader to replica safe time propagation

Alexander Lapin (Jira) Wed, 06 Jul 2022 06:58:05 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-17263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alexander Lapin updated IGNITE-17263:
-------------------------------------
    Description: 
In order to perform replica reads, it's required either to use read index or 
check the safe time. Let's recall corresponding section from tx design document.

RO transactions can be executed on non-primary replicas. write intent 
resolution doesn’t help because a write intent for a committed transaction may 
not be yet replicated to the replica. To mitigate this issue, it’s enough to 
run readIndex on each mapped partition leader, fetch the commit index and wait 
on a replica until it’s applied. This will guarantee that all required write 
intents are replicated and present locally. After that the normal write intern 
resolution should do the job.

There is a second option, which doesn’t require the network RTT. We can use a 
special low watermark timestamp (safeTs) per replication group, which 
corresponds to the apply index of a replicated entry, so then an apply index is 
advanced during the replication, then the safeTs is monotonically incremented 
too. The HLC used for safeTs advancing is assigned to a replicated entry in an 
ordered way.

Special measures are needed to periodically advance the safeTs if no updates 
are happening. It’s enough to use a special replication command for this 
purpose.

All we need during RO txn is to wait until a safeTs advances past the RO txn 
readTs. 
 !Screenshot from 2022-07-06 16-48-30.png! 
In the picture we have two concurrent transactions mapped to the same 
partition: T1 and T2.
OpReq(w1(x)) and OpReq(w2(x)) are received concurrently. Each write intent is 
assigned a timestamp in a monotonic order consistent with the replication 
order. This can be for example done when replication entries are dequeued for 
processing by replication protocol (we assume entries are replicated 
successively.

It’s not enough only to wait for safeTs - it may never happen due to absence of 
activity in the partition. Consider the next diagram:
 !Screenshot from 2022-07-06 16-48-41.png! 
We need an additional safeTsSync command to propagate a safeTs event in case 
there are no updates in the partition.

Actually, it seems that it's possible to reuse common raft messages such as 
heartbeatRequests, vote/prevoteRequests together with appendEntriesRequests in 
order to propagate safeTime from leader to replicas. As was mentioned in 
[IGNITE-17261|https://issues.apache.org/jira/browse/IGNITE-17261] txnState 
switch should be linearized with all safe-time propagation requests.

  was:
In order to perform replica reads, it's required either to use read index or 
check the safe time. Let's recall corresponding section from tx design document.

RO transactions can be executed on non-primary replicas. write intent 
resolution doesn’t help because a write intent for a committed transaction may 
not be yet replicated to the replica. To mitigate this issue, it’s enough to 
run readIndex on each mapped partition leader, fetch the commit index and wait 
on a replica until it’s applied. This will guarantee that all required write 
intents are replicated and present locally. After that the normal write intern 
resolution should do the job.

There is a second option, which doesn’t require the network RTT. We can use a 
special low watermark timestamp (safeTs) per replication group, which 
corresponds to the apply index of a replicated entry, so then an apply index is 
advanced during the replication, then the safeTs is monotonically incremented 
too. The HLC used for safeTs advancing is assigned to a replicated entry in an 
ordered way.

Special measures are needed to periodically advance the safeTs if no updates 
are happening. It’s enough to use a special replication command for this 
purpose.

All we need during RO txn is to wait until a safeTs advances past the RO txn 
readTs. 
 !Screenshot from 2022-07-06 16-48-30.png! 


> Implement leader to replica safe time propagation
> -------------------------------------------------
>
>                 Key: IGNITE-17263
>                 URL: https://issues.apache.org/jira/browse/IGNITE-17263
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Alexander Lapin
>            Priority: Major
>              Labels: ignite-3, transaction3_ro
>         Attachments: Screenshot from 2022-07-06 16-48-30.png, Screenshot from 
> 2022-07-06 16-48-41.png
>
>
> In order to perform replica reads, it's required either to use read index or 
> check the safe time. Let's recall corresponding section from tx design 
> document.
> RO transactions can be executed on non-primary replicas. write intent 
> resolution doesn’t help because a write intent for a committed transaction 
> may not be yet replicated to the replica. To mitigate this issue, it’s enough 
> to run readIndex on each mapped partition leader, fetch the commit index and 
> wait on a replica until it’s applied. This will guarantee that all required 
> write intents are replicated and present locally. After that the normal write 
> intern resolution should do the job.
> There is a second option, which doesn’t require the network RTT. We can use a 
> special low watermark timestamp (safeTs) per replication group, which 
> corresponds to the apply index of a replicated entry, so then an apply index 
> is advanced during the replication, then the safeTs is monotonically 
> incremented too. The HLC used for safeTs advancing is assigned to a 
> replicated entry in an ordered way.
> Special measures are needed to periodically advance the safeTs if no updates 
> are happening. It’s enough to use a special replication command for this 
> purpose.
> All we need during RO txn is to wait until a safeTs advances past the RO txn 
> readTs. 
>  !Screenshot from 2022-07-06 16-48-30.png! 
> In the picture we have two concurrent transactions mapped to the same 
> partition: T1 and T2.
> OpReq(w1(x)) and OpReq(w2(x)) are received concurrently. Each write intent is 
> assigned a timestamp in a monotonic order consistent with the replication 
> order. This can be for example done when replication entries are dequeued for 
> processing by replication protocol (we assume entries are replicated 
> successively.
> It’s not enough only to wait for safeTs - it may never happen due to absence 
> of activity in the partition. Consider the next diagram:
>  !Screenshot from 2022-07-06 16-48-41.png! 
> We need an additional safeTsSync command to propagate a safeTs event in case 
> there are no updates in the partition.
> Actually, it seems that it's possible to reuse common raft messages such as 
> heartbeatRequests, vote/prevoteRequests together with appendEntriesRequests 
> in order to propagate safeTime from leader to replicas. As was mentioned in 
> [IGNITE-17261|https://issues.apache.org/jira/browse/IGNITE-17261] txnState 
> switch should be linearized with all safe-time propagation requests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-17263) Implement leader to replica safe time propagation

Reply via email to