[jira] [Updated] (RATIS-2376) Support bounded stale reads

Ivan Andika (Jira) Wed, 07 Jan 2026 21:39:07 -0800


     [ 
https://issues.apache.org/jira/browse/RATIS-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ivan Andika updated RATIS-2376:
-------------------------------
    Description: 
This is an idea that came up while discussing about improving follower reads by 
guaranteeing bounded stale reads based on time.

In HDDS-13954, Ozone supports LocalLease feature where it uses
 # The leader's last RPC elapsed time to see if the local lease has expired 
since it has not the leader has not been able to be contacted for a certain time
 # The difference between the leader's commit index and the follower's applied 
index (both are configurable) so that data returned is not too stale.

The issue with (2) is that index difference have a few problems
 # Using index difference not user-friendly: Users do not know how index 
difference translates to how stale a data can be. For example, we set the index 
difference to be 10,000. In an idle cluster (1000 transactions / hour), this 
can mean follower can return data that is 10 hours stale data, while in busy 
cluster (100,000 transactions / hour), this means that follower can return data 
that is 6 minutes (60 minutes / 10) late. There is no mechanism to allow user 
to set such that it can read data that is at most 5 minutes stale.
 # Leader's commit index is only updated after AppendEntries. Therefore, if the 
leader AppendEntries RPC is late, the commit index not be updated, causing 
follower to think that the follower is too slow and failing the reads although 
in reality the follower is caught up.

One idea is to use timestamp differences to setup bounds on the how stale a 
read can be
 - Client can configure the max staleness bound (e.g. 100ms means that user can 
read data that at most 100ms old)
 -- 0 means will wait until the follower timestamp is the same as the leader's 
timestamp.
 -- This provides a more user-friendly guarantee (timestamp instead of index 
difference)
 - Implementations
 -- Ratis log now adds a timestamp of the leader for the term that will be used 
by the follower to check on the bounds
 -- When a read request is sent to the follower, the follower will check the 
last applied log's leader timestamp and only returns when current timestamp and 
the last applied log's leader timestamp is within the staleness bound
 --- If it's not, the read is either blocked (not recommended) or requeued 
(recommended) to the RPC queue
 -- For idle cluster with low writes, the last applied timestamp will not be 
advanced so the stale bounded reads might be blocked indefinitely
 --- Few approaches to handle this
 ---- Setup a side channel to propagate the leader's timestamp
 ---- Fallback to linearizable read
 ----- The argument is that since ReadIndex performance should be good if there 
are not contending writes because follower does not need to wait as the 
follower applied index should already reach the read index

Afterwards, we can send a queryStale(Message request, long maxStalenessMs, long 
lastSeenIndex, long lastSeenTimestamp) to implement this. 

Consistency guarantee
 * Configurable bounded stale reads
 * Consistent prefix / snapshot isolation (if specified): The ordering of the 
transactions are still guaranteed (e.g. meaning if key a is created and then 
key b is created in the leader, the follower should see the same ordering) due 
to the raft sequential nature
 * Monotonic reads (if last seen index or timestamp is specified): A single 
client can see data as monotonically increasing

Advantages
 * This removes the need to do one RTT on the ReadIndex algorithm, which 
improves latency
 * In multi-region setup, we can route the read request to the nearest follower 
instead of the leader (or follower needing to contact the leader for each call)

However, this approach has some tradeoffs
 * We need to ensure that CPU clock bound drift is bounded (at least most of 
the times)
 ** AFAIK, unless we have a specialized hardware like Google TrueTime, it's not 
currently possible
 * The addition of timestamp in Raft log might adds some space and memory 
overhead

Alternative implementations
 * CockroachDB: 
[https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20210519_bounded_staleness_reads.md]

  was:
This is an idea that came up while discussing about improving follower reads by 
guaranteeing bounded stale reads based on time.

In HDDS-13954, Ozone supports LocalLease feature where it uses
 # The leader's last RPC elapsed time to see if the local lease has expired 
since it has not the leader has not been able to be contacted for a certain time
 # The difference between the leader's commit index and the follower's applied 
index (both are configurable) so that data returned is not too stale.

The issue with (2) is that index difference have a few problems
 # Using index difference not user-friendly: Users do not know how index 
difference translates to how stale a data can be. For example, we set the index 
difference to be 10,000. In an idle cluster (1000 transactions / hour), this 
can mean follower can return data that is 10 hours stale data, while in busy 
cluster (100,000 transactions / hour), this means that follower can return data 
that is 6 minutes (60 minutes / 10) late. There is no mechanism to allow user 
to set such that it can read data that is at most 5 minutes stale.
 # Leader's commit index is only updated after AppendEntries. Therefore, if the 
leader AppendEntries RPC is late, the commit index not be updated, causing 
follower to think that the follower is too slow and failing the reads although 
in reality the follower is caught up.

One idea is to use timestamp differences to setup bounds on the how stale a 
read can be
 - Client can configure the max staleness bound (e.g. 100ms means that user can 
read data that at most 100ms old)
 -- 0 means will wait until the follower timestamp is the same as the leader's 
timestamp.
 -- This provides a more user-friendly guarantee (timestamp instead of index 
difference)
 - Implementations
 -- Ratis log now adds a timestamp of the leader for the term that will be used 
by the follower to check on the bounds
 -- When a read request is sent to the follower, the follower will check the 
last applied log's leader timestamp and only returns when current timestamp and 
the last applied log's leader timestamp is within the staleness bound
 --- If it's not, the read is either blocked (not recommended) or requeued 
(recommended) to the RPC queue

Afterwards, we can send a queryStale(Message request, long maxStalenessMs, long 
lastSeenIndex, long lastSeenTimestamp) to implement this. 

Consistency guarantee
 * Configurable bounded stale reads
 * Consistent prefix / snapshot isolation (if specified): The ordering of the 
transactions are still guaranteed (e.g. meaning if key a is created and then 
key b is created in the leader, the follower should see the same ordering) due 
to the raft sequential nature
 * Monotonic reads (if last seen index or timestamp is specified): A single 
client can see data as monotonically increasing

Advantages
 * This removes the need to do one RTT on the ReadIndex algorithm, which 
improves latency
 * In multi-region setup, we can route the read request to the nearest follower 
instead of the leader (or follower needing to contact the leader for each call)

However, this approach has some tradeoffs
 * We need to ensure that CPU clock bound drift is bounded (at least most of 
the times)
 ** AFAIK, unless we have a specialized hardware like Google TrueTime, it's not 
currently possible
 * The addition of timestamp in Raft log might adds some space and memory 
overhead

Alternative implementations
 * CockroachDB: 
[https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20210519_bounded_staleness_reads.md]


> Support bounded stale reads
> ---------------------------
>
>                 Key: RATIS-2376
>                 URL: https://issues.apache.org/jira/browse/RATIS-2376
>             Project: Ratis
>          Issue Type: New Feature
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>
> This is an idea that came up while discussing about improving follower reads 
> by guaranteeing bounded stale reads based on time.
> In HDDS-13954, Ozone supports LocalLease feature where it uses
>  # The leader's last RPC elapsed time to see if the local lease has expired 
> since it has not the leader has not been able to be contacted for a certain 
> time
>  # The difference between the leader's commit index and the follower's 
> applied index (both are configurable) so that data returned is not too stale.
> The issue with (2) is that index difference have a few problems
>  # Using index difference not user-friendly: Users do not know how index 
> difference translates to how stale a data can be. For example, we set the 
> index difference to be 10,000. In an idle cluster (1000 transactions / hour), 
> this can mean follower can return data that is 10 hours stale data, while in 
> busy cluster (100,000 transactions / hour), this means that follower can 
> return data that is 6 minutes (60 minutes / 10) late. There is no mechanism 
> to allow user to set such that it can read data that is at most 5 minutes 
> stale.
>  # Leader's commit index is only updated after AppendEntries. Therefore, if 
> the leader AppendEntries RPC is late, the commit index not be updated, 
> causing follower to think that the follower is too slow and failing the reads 
> although in reality the follower is caught up.
> One idea is to use timestamp differences to setup bounds on the how stale a 
> read can be
>  - Client can configure the max staleness bound (e.g. 100ms means that user 
> can read data that at most 100ms old)
>  -- 0 means will wait until the follower timestamp is the same as the 
> leader's timestamp.
>  -- This provides a more user-friendly guarantee (timestamp instead of index 
> difference)
>  - Implementations
>  -- Ratis log now adds a timestamp of the leader for the term that will be 
> used by the follower to check on the bounds
>  -- When a read request is sent to the follower, the follower will check the 
> last applied log's leader timestamp and only returns when current timestamp 
> and the last applied log's leader timestamp is within the staleness bound
>  --- If it's not, the read is either blocked (not recommended) or requeued 
> (recommended) to the RPC queue
>  -- For idle cluster with low writes, the last applied timestamp will not be 
> advanced so the stale bounded reads might be blocked indefinitely
>  --- Few approaches to handle this
>  ---- Setup a side channel to propagate the leader's timestamp
>  ---- Fallback to linearizable read
>  ----- The argument is that since ReadIndex performance should be good if 
> there are not contending writes because follower does not need to wait as the 
> follower applied index should already reach the read index
> Afterwards, we can send a queryStale(Message request, long maxStalenessMs, 
> long lastSeenIndex, long lastSeenTimestamp) to implement this. 
> Consistency guarantee
>  * Configurable bounded stale reads
>  * Consistent prefix / snapshot isolation (if specified): The ordering of the 
> transactions are still guaranteed (e.g. meaning if key a is created and then 
> key b is created in the leader, the follower should see the same ordering) 
> due to the raft sequential nature
>  * Monotonic reads (if last seen index or timestamp is specified): A single 
> client can see data as monotonically increasing
> Advantages
>  * This removes the need to do one RTT on the ReadIndex algorithm, which 
> improves latency
>  * In multi-region setup, we can route the read request to the nearest 
> follower instead of the leader (or follower needing to contact the leader for 
> each call)
> However, this approach has some tradeoffs
>  * We need to ensure that CPU clock bound drift is bounded (at least most of 
> the times)
>  ** AFAIK, unless we have a specialized hardware like Google TrueTime, it's 
> not currently possible
>  * The addition of timestamp in Raft log might adds some space and memory 
> overhead
> Alternative implementations
>  * CockroachDB: 
> [https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20210519_bounded_staleness_reads.md]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (RATIS-2376) Support bounded stale reads

Reply via email to