[
https://issues.apache.org/jira/browse/RATIS-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Andika updated RATIS-2376:
-------------------------------
Summary: Support bounded stale reads (was: Support StateMachine#queryStale
based on leader and follower timestamp difference)
> Support bounded stale reads
> ---------------------------
>
> Key: RATIS-2376
> URL: https://issues.apache.org/jira/browse/RATIS-2376
> Project: Ratis
> Issue Type: New Feature
> Reporter: Ivan Andika
> Assignee: Ivan Andika
> Priority: Major
>
> This is an idea that came up while discussing about improving follower reads
> by guaranteeing bounded stale reads based on time.
> In HDDS-13954, Ozone supports LocalLease feature where it uses
> # The leader's last RPC elapsed time to see if the local lease has expired
> since it has not the leader has not been able to be contacted for a certain
> time
> # The difference between the leader's commit index and the follower's
> applied index (both are configurable) so that data returned is not too stale.
> The issue with (2) is that index difference have a few problems
> # Using index difference not user-friendly: Users do not know how index
> difference translates to how stale a data can be. For example, we set the
> index difference to be 10,000. In an idle cluster (1000 transactions / hour),
> this can mean follower can return data that is 10 hours stale data, while in
> busy cluster (100,000 transactions / hour), this means that follower can
> return data that is 6 minutes (60 minutes / 10) late. There is no mechanism
> to allow user to set such that it can read data that is at most 5 minutes
> stale.
> # Leader's commit index is only updated after AppendEntries. Therefore, if
> the leader AppendEntries RPC is late, the commit index not be updated,
> causing follower to think that the follower is too slow and failing the reads
> although in reality the follower is caught up.
> One idea is to use timestamp differences to setup bounds on the how stale a
> read can be
> - Client can configure the staleness bound (e.g. 100ms means that user can
> read data that at most stale for 100ms)
> -- 0 means read from leader all the time
> -- This provides a more user-friendly guarantee (timestamp instead of index
> difference)
> - Implementations
> -- Ratis log now adds a timestamp of the leader for the term that will be
> used by the follower to check on the bounds
> -- When a read request is sent to the follower, the follower will check the
> last applied log's leader timestamp and only returns when current timestamp
> and the last applied log's leader timestamp is within the staleness bound
> --- If it's not, the read is either blocked (not recommended) or requeued
> (recommended) to the RPC queue
> Afterwards, we can send a queryStale(Message request, long timestampDiff) to
> implement this.
> However, this approach has some tradeoffs
> * We need to ensure that CPU clock bound drift is bounded (at least most of
> the times)
> ** AFAIK, unless we have a specialized hardware like Google TrueTime, it's
> not currently possible
> * The addition of timestamp in Raft log might adds some space and memory
> overhead
> Alternative implementations
> * CockroachDB:
> https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20210519_bounded_staleness_reads.md
--
This message was sent by Atlassian Jira
(v8.20.10#820010)