[jira] [Created] (RATIS-2376) Support StateMachine#queryStale based on leader and follower timestamp difference

Ivan Andika (Jira) Mon, 22 Dec 2025 18:44:07 -0800

Ivan Andika created RATIS-2376:
----------------------------------

             Summary: Support StateMachine#queryStale based on leader and 
follower timestamp difference
                 Key: RATIS-2376
                 URL: https://issues.apache.org/jira/browse/RATIS-2376
             Project: Ratis
          Issue Type: New Feature
            Reporter: Ivan Andika
            Assignee: Ivan Andika



This is an idea that came up while discussing about improving follower reads.

In HDDS-13954, Ozone supports LocalLease feature where it uses 
1. The leader's last RPC elapsed time to see if the local lease has expired 
since it has not the leader has not been able to be contacted for a certain time
2. The difference between the leader's commit index and the follower's applied 
index (both are configurable) so that data returned is not too stale.

The issue with (2) is that index difference have a few problems
1. Using index difference not user-friendly: Users do not know how index 
difference translates to how stale a data can be. For example, we set the index 
difference to be 10,000. In a idle cluster (1000 transactions / hour), this can 
mean follower can return data that is 10 hours stale data while in busy cluster 
(100,000 transactions / hour), this means that follower can return data that is 
6 minutes (60 minutes / 10) late. There is no mechanism to allow user to set 
such that it can read data that is at most 5 minutes stale.
2. Leader's commit index is only updated after AppendEntries. Therefore, if the 
leader AppendEntries RPC is late, the commit index not be updated, causing 
follower to think that the follower is too slow and failing the reads although 
in reality the follower is caught up.

One idea is to use timestamp, instead of index differences to setup bounds on 
the how stale a read can be
 - Client can configure the staleness bound (e.g. 100ms means that user can 
read data that at most stale for 100ms)
 - 0 means read from leader all the time
 - This provides a more user-friendly guarantee
 - Ratis log now adds a timestamp of the leader for the term that will be used 
by the follower to check on the bound
 - When a read request is sent to the follower, the follower will check the 
last applied log's leader timestamp and only returns when current timestamp and 
the last applied log's leader timestamp is within the staleness bound
If it's not, the read is either blocked (not recommended) or requeued 
(recommended) to the RPC queue

Afterwards, we can send a queryStale(Message request, long timestampDiff) to 
implement this. 

However, this approach has a tradeoffs
 * We need to ensure that CPU clock bound drift is bounded (at least most of 
the times)
 ** AFAIK, unless we have a specialized hardware like Google TrueTime, it's not 
currently possible
 * The addition of timestamp in Raft log might adds some space and memory 
overhead



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (RATIS-2376) Support StateMachine#queryStale based on leader and follower timestamp difference

Reply via email to