[ 
https://issues.apache.org/jira/browse/RATIS-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Andika updated RATIS-2376:
-------------------------------
    Summary: Support bounded stale reads  (was: Support StateMachine#queryStale 
based on leader and follower timestamp difference)

> Support bounded stale reads
> ---------------------------
>
>                 Key: RATIS-2376
>                 URL: https://issues.apache.org/jira/browse/RATIS-2376
>             Project: Ratis
>          Issue Type: New Feature
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>
> This is an idea that came up while discussing about improving follower reads 
> by guaranteeing bounded stale reads based on time.
> In HDDS-13954, Ozone supports LocalLease feature where it uses
>  # The leader's last RPC elapsed time to see if the local lease has expired 
> since it has not the leader has not been able to be contacted for a certain 
> time
>  # The difference between the leader's commit index and the follower's 
> applied index (both are configurable) so that data returned is not too stale.
> The issue with (2) is that index difference have a few problems
>  # Using index difference not user-friendly: Users do not know how index 
> difference translates to how stale a data can be. For example, we set the 
> index difference to be 10,000. In an idle cluster (1000 transactions / hour), 
> this can mean follower can return data that is 10 hours stale data, while in 
> busy cluster (100,000 transactions / hour), this means that follower can 
> return data that is 6 minutes (60 minutes / 10) late. There is no mechanism 
> to allow user to set such that it can read data that is at most 5 minutes 
> stale.
>  # Leader's commit index is only updated after AppendEntries. Therefore, if 
> the leader AppendEntries RPC is late, the commit index not be updated, 
> causing follower to think that the follower is too slow and failing the reads 
> although in reality the follower is caught up.
> One idea is to use timestamp differences to setup bounds on the how stale a 
> read can be
>  - Client can configure the staleness bound (e.g. 100ms means that user can 
> read data that at most stale for 100ms)
>  -- 0 means read from leader all the time
>  -- This provides a more user-friendly guarantee (timestamp instead of index 
> difference)
>  - Implementations
>  -- Ratis log now adds a timestamp of the leader for the term that will be 
> used by the follower to check on the bounds
>  -- When a read request is sent to the follower, the follower will check the 
> last applied log's leader timestamp and only returns when current timestamp 
> and the last applied log's leader timestamp is within the staleness bound
>  --- If it's not, the read is either blocked (not recommended) or requeued 
> (recommended) to the RPC queue
> Afterwards, we can send a queryStale(Message request, long timestampDiff) to 
> implement this. 
> However, this approach has some tradeoffs
>  * We need to ensure that CPU clock bound drift is bounded (at least most of 
> the times)
>  ** AFAIK, unless we have a specialized hardware like Google TrueTime, it's 
> not currently possible
>  * The addition of timestamp in Raft log might adds some space and memory 
> overhead
> Alternative implementations
>  * CockroachDB: 
> https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20210519_bounded_staleness_reads.md



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to