[jira] [Updated] (RATIS-2376) Support StateMachine#queryStale based on leader and follower timestamp difference

Ivan Andika (Jira) Mon, 22 Dec 2025 18:54:05 -0800


     [ 
https://issues.apache.org/jira/browse/RATIS-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ivan Andika updated RATIS-2376:
-------------------------------
    Description: 
This is an idea that came up while discussing about improving follower reads.

In HDDS-13954, Ozone supports LocalLease feature where it uses
 # The leader's last RPC elapsed time to see if the local lease has expired 
since it has not the leader has not been able to be contacted for a certain time
 # 2The difference between the leader's commit index and the follower's applied 
index (both are configurable) so that data returned is not too stale.

The issue with (2) is that index difference have a few problems
 # Using index difference not user-friendly: Users do not know how index 
difference translates to how stale a data can be. For example, we set the index 
difference to be 10,000. In a idle cluster (1000 transactions / hour), this can 
mean follower can return data that is 10 hours stale data while in busy cluster 
(100,000 transactions / hour), this means that follower can return data that is 
6 minutes (60 minutes / 10) late. There is no mechanism to allow user to set 
such that it can read data that is at most 5 minutes stale.
 # Leader's commit index is only updated after AppendEntries. Therefore, if the 
leader AppendEntries RPC is late, the commit index not be updated, causing 
follower to think that the follower is too slow and failing the reads although 
in reality the follower is caught up.

One idea is to use timestamp differences to setup bounds on the how stale a 
read can be
 - Client can configure the staleness bound (e.g. 100ms means that user can 
read data that at most stale for 100ms)
 -- 0 means read from leader all the time
 -- This provides a more user-friendly guarantee (timestamp instead of index 
difference)
 - Implementations
 -- Ratis log now adds a timestamp of the leader for the term that will be used 
by the follower to check on the bounds
 -- When a read request is sent to the follower, the follower will check the 
last applied log's leader timestamp and only returns when current timestamp and 
the last applied log's leader timestamp is within the staleness bound
 --- If it's not, the read is either blocked (not recommended) or requeued 
(recommended) to the RPC queue

Afterwards, we can send a queryStale(Message request, long timestampDiff) to 
implement this. 

However, this approach has some tradeoffs
 * We need to ensure that CPU clock bound drift is bounded (at least most of 
the times)
 ** AFAIK, unless we have a specialized hardware like Google TrueTime, it's not 
currently possible
 * The addition of timestamp in Raft log might adds some space and memory 
overhead

  was:
This is an idea that came up while discussing about improving follower reads.

In HDDS-13954, Ozone supports LocalLease feature where it uses
 # The leader's last RPC elapsed time to see if the local lease has expired 
since it has not the leader has not been able to be contacted for a certain time
 # 2The difference between the leader's commit index and the follower's applied 
index (both are configurable) so that data returned is not too stale.

The issue with (2) is that index difference have a few problems
 # Using index difference not user-friendly: Users do not know how index 
difference translates to how stale a data can be. For example, we set the index 
difference to be 10,000. In a idle cluster (1000 transactions / hour), this can 
mean follower can return data that is 10 hours stale data while in busy cluster 
(100,000 transactions / hour), this means that follower can return data that is 
6 minutes (60 minutes / 10) late. There is no mechanism to allow user to set 
such that it can read data that is at most 5 minutes stale.
 # Leader's commit index is only updated after AppendEntries. Therefore, if the 
leader AppendEntries RPC is late, the commit index not be updated, causing 
follower to think that the follower is too slow and failing the reads although 
in reality the follower is caught up.

One idea is to use timestamp, instead of index differences to setup bounds on 
the how stale a read can be
 - Client can configure the staleness bound (e.g. 100ms means that user can 
read data that at most stale for 100ms)
 -- 0 means read from leader all the time
 -- This provides a more user-friendly guarantee (timestamp instead of index 
difference)
 - Implementations
 -- Ratis log now adds a timestamp of the leader for the term that will be used 
by the follower to check on the bounds
 -- When a read request is sent to the follower, the follower will check the 
last applied log's leader timestamp and only returns when current timestamp and 
the last applied log's leader timestamp is within the staleness bound
 --- If it's not, the read is either blocked (not recommended) or requeued 
(recommended) to the RPC queue

Afterwards, we can send a queryStale(Message request, long timestampDiff) to 
implement this. 

However, this approach has some tradeoffs
 * We need to ensure that CPU clock bound drift is bounded (at least most of 
the times)
 ** AFAIK, unless we have a specialized hardware like Google TrueTime, it's not 
currently possible
 * The addition of timestamp in Raft log might adds some space and memory 
overhead


> Support StateMachine#queryStale based on leader and follower timestamp 
> difference
> ---------------------------------------------------------------------------------
>
>                 Key: RATIS-2376
>                 URL: https://issues.apache.org/jira/browse/RATIS-2376
>             Project: Ratis
>          Issue Type: New Feature
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>
> This is an idea that came up while discussing about improving follower reads.
> In HDDS-13954, Ozone supports LocalLease feature where it uses
>  # The leader's last RPC elapsed time to see if the local lease has expired 
> since it has not the leader has not been able to be contacted for a certain 
> time
>  # 2The difference between the leader's commit index and the follower's 
> applied index (both are configurable) so that data returned is not too stale.
> The issue with (2) is that index difference have a few problems
>  # Using index difference not user-friendly: Users do not know how index 
> difference translates to how stale a data can be. For example, we set the 
> index difference to be 10,000. In a idle cluster (1000 transactions / hour), 
> this can mean follower can return data that is 10 hours stale data while in 
> busy cluster (100,000 transactions / hour), this means that follower can 
> return data that is 6 minutes (60 minutes / 10) late. There is no mechanism 
> to allow user to set such that it can read data that is at most 5 minutes 
> stale.
>  # Leader's commit index is only updated after AppendEntries. Therefore, if 
> the leader AppendEntries RPC is late, the commit index not be updated, 
> causing follower to think that the follower is too slow and failing the reads 
> although in reality the follower is caught up.
> One idea is to use timestamp differences to setup bounds on the how stale a 
> read can be
>  - Client can configure the staleness bound (e.g. 100ms means that user can 
> read data that at most stale for 100ms)
>  -- 0 means read from leader all the time
>  -- This provides a more user-friendly guarantee (timestamp instead of index 
> difference)
>  - Implementations
>  -- Ratis log now adds a timestamp of the leader for the term that will be 
> used by the follower to check on the bounds
>  -- When a read request is sent to the follower, the follower will check the 
> last applied log's leader timestamp and only returns when current timestamp 
> and the last applied log's leader timestamp is within the staleness bound
>  --- If it's not, the read is either blocked (not recommended) or requeued 
> (recommended) to the RPC queue
> Afterwards, we can send a queryStale(Message request, long timestampDiff) to 
> implement this. 
> However, this approach has some tradeoffs
>  * We need to ensure that CPU clock bound drift is bounded (at least most of 
> the times)
>  ** AFAIK, unless we have a specialized hardware like Google TrueTime, it's 
> not currently possible
>  * The addition of timestamp in Raft log might adds some space and memory 
> overhead



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (RATIS-2376) Support StateMachine#queryStale based on leader and follower timestamp difference

Reply via email to