[ 
https://issues.apache.org/jira/browse/IGNITE-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-17872:
----------------------------------
    Description: 
Safe time for non-primary replicas (see IGNITE-17263 ) was conceived as 
optimization to avoid unnecessary network hops. Safe time is propagated from 
primary replica via raft appendEntries messages. When there is constant load on 
cluster that is caused by RW transactions, these messages are refreshing safe 
time on replicas with decent frequency, but in case of idle cluster, or cluster 
with read-only load, safe time is propagated periodically via heartbeats. This 
means that, if a RO transaction with read timestamp in present or future, is 
trying to read a value from non-primary replica, it will wait for safe time 
first, which is bound to frequency of heartbeat messages, and hence, the 
duration of the read operation may be close to the period of heartbeats. This 
looks weird and will cause performance issues. 

Example:
Heartbeat period is 500 ms. 
Current safe time on replica is 1.
We are processing read-only request with timestamp=2. 
There were no RW transactions for some time, and the next expected update of 
safe time, according to the heartbeat period, is 1 + 500 = 501.
This means that we should wait for about 499 ms (assuming the clock skew and 
ping in cluster is 0) to proceed with RO request processing.

So, even though safe time is an optimization, we shouldn't use it in cases when 
there are no RW transactions affecting the given replica, and the timestamp of 
current RO transaction is greater than safe time. Instead of waiting for the 
safe time update, we should fallback to reading index from the leader to 
minimize the time of processing the current RO request.

To do this, we should compare the read timestamp with safe time, and if read 
timestamp is greater, and since the last RW transaction (affecting this 
replica) some time passed that is greater than some timeout (i.e. we expect 
that the safe time will be updated only via periodic updates) we shouldn't wait 
for safe time and perform read index request to leader to get the latest 
updates that may not have been replicated yet. 

If readIndex shows that the current committed index on leader is the same that 
is on replica (i.e. replica doesn't expect any updates that are being 
replicated) it means that read-only request must wait for safe time update 
without further attempts to repeat readIndex operation which highly likely will 
be mostly useless.

We should also think about the measures to prevent extra load from replicas 
spamming readIndex while receiving multiple read-only requests.

  was:
Safe time for non-primary replicas (see IGNITE-17263 ) was conceived as 
optimization to avoid unnecessary network hops. Safe time is propagated from 
primary replica via raft appendEntries messages. When there is constant load on 
cluster that is caused by RW transactions, these messages are refreshing safe 
time on replicas with decent frequency, but in case of idle cluster, or cluster 
with read-only load, safe time is propagated periodically via heartbeats. This 
means that, if a RO transaction with read timestamp in present or future, is 
trying to read a value from non-primary replica, it will wait for safe time 
first, which is bound to frequency of heartbeat messages, and hence, the 
duration of the read operation may be close to the period of heartbeats. This 
looks weird and will cause performance issues. 

Example:
Heartbeat period is 500 ms. 
Current safe time on replica is 1.
We are processing read-only request with timestamp=2. 
There were no RW transactions for some time, and the next expected update of 
safe time, according to the heartbeat period, is 1 + 500 = 501.
This means that we should wait for about 499 ms (assuming the clock skew and 
ping in cluster is 0) to proceed with RO request processing.

So, even though safe time is an optimization, we shouldn't use it in cases when 
there are no RW transactions affecting the given replica, and the timestamp of 
current RO transaction is greater than safe time. Instead of waiting for the 
safe time update, we should fallback to reading index from the leader to 
minimize the time of processing the current RO request.

To do this, we should compare the read timestamp with safe time, and if read 
timestamp is greater, and since the last RW transaction (affecting this 
replica) some time passed that is greater than some timeout (i.e. we expect 
that the safe time will be updated only via periodic updates) we shouldn't wait 
for safe time and perform read index request to leader to get the latest 
updates that may not have been replicated yet. 

If readIndex shows that the current committed index on leader is the same that 
is on replica (i.e. replica doesn't expect any updates that are being 
replicated) it means that read-only request must wait for actual safe time 
without further attempts to repeat readIndex operation which highly likely will 
be mostly useless.

We should also think about the measures to prevent extra load from replicas 
spamming readIndex while receiving multiple read-only requests.


> Fetch commit index on non-primary replicas instead of waiting for safe time 
> in case of RO tx on idle cluster
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-17872
>                 URL: https://issues.apache.org/jira/browse/IGNITE-17872
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Denis Chudov
>            Priority: Major
>              Labels: ignite-3, transaction3_ro
>
> Safe time for non-primary replicas (see IGNITE-17263 ) was conceived as 
> optimization to avoid unnecessary network hops. Safe time is propagated from 
> primary replica via raft appendEntries messages. When there is constant load 
> on cluster that is caused by RW transactions, these messages are refreshing 
> safe time on replicas with decent frequency, but in case of idle cluster, or 
> cluster with read-only load, safe time is propagated periodically via 
> heartbeats. This means that, if a RO transaction with read timestamp in 
> present or future, is trying to read a value from non-primary replica, it 
> will wait for safe time first, which is bound to frequency of heartbeat 
> messages, and hence, the duration of the read operation may be close to the 
> period of heartbeats. This looks weird and will cause performance issues. 
> Example:
> Heartbeat period is 500 ms. 
> Current safe time on replica is 1.
> We are processing read-only request with timestamp=2. 
> There were no RW transactions for some time, and the next expected update of 
> safe time, according to the heartbeat period, is 1 + 500 = 501.
> This means that we should wait for about 499 ms (assuming the clock skew and 
> ping in cluster is 0) to proceed with RO request processing.
> So, even though safe time is an optimization, we shouldn't use it in cases 
> when there are no RW transactions affecting the given replica, and the 
> timestamp of current RO transaction is greater than safe time. Instead of 
> waiting for the safe time update, we should fallback to reading index from 
> the leader to minimize the time of processing the current RO request.
> To do this, we should compare the read timestamp with safe time, and if read 
> timestamp is greater, and since the last RW transaction (affecting this 
> replica) some time passed that is greater than some timeout (i.e. we expect 
> that the safe time will be updated only via periodic updates) we shouldn't 
> wait for safe time and perform read index request to leader to get the latest 
> updates that may not have been replicated yet. 
> If readIndex shows that the current committed index on leader is the same 
> that is on replica (i.e. replica doesn't expect any updates that are being 
> replicated) it means that read-only request must wait for safe time update 
> without further attempts to repeat readIndex operation which highly likely 
> will be mostly useless.
> We should also think about the measures to prevent extra load from replicas 
> spamming readIndex while receiving multiple read-only requests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to