[
https://issues.apache.org/jira/browse/RATIS-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Andika updated RATIS-2376:
-------------------------------
Summary: Support timestamp bounded stale reads (was: Support bounded stale
reads)
> Support timestamp bounded stale reads
> -------------------------------------
>
> Key: RATIS-2376
> URL: https://issues.apache.org/jira/browse/RATIS-2376
> Project: Ratis
> Issue Type: New Feature
> Reporter: Ivan Andika
> Assignee: Ivan Andika
> Priority: Major
>
> This is an idea that came up while discussing about improving follower reads
> by guaranteeing bounded stale reads based on time.
> In HDDS-13954, Ozone supports LocalLease feature where it uses
> # The leader's last RPC elapsed time to see if the local lease has expired
> since it has not the leader has not been able to be contacted for a certain
> time
> # The difference between the leader's commit index and the follower's
> applied index (both are configurable) so that data returned is not too stale.
> The issue with (2) is that index difference have a few problems
> # Using index difference not user-friendly: Users do not know how index
> difference translates to how stale a data can be. For example, we set the
> index difference to be 10,000. In an idle cluster (1000 transactions / hour),
> this can mean follower can return data that is 10 hours stale data, while in
> busy cluster (100,000 transactions / hour), this means that follower can
> return data that is 6 minutes (60 minutes / 10) late. There is no mechanism
> to allow user to set such that it can read data that is at most 5 minutes
> stale.
> # Leader's commit index is only updated after AppendEntries. Therefore, if
> the leader AppendEntries RPC is late, the commit index not be updated,
> causing follower to think that the follower is too slow and failing the reads
> although in reality the follower is caught up.
> One idea is to use timestamp differences to setup bounds on the how stale a
> read can be
> - Client can configure the max staleness bound (e.g. 100ms means that user
> can read data that at most 100ms old)
> -- 0 means will wait until the follower timestamp is the same as the
> leader's timestamp.
> -- This provides a more user-friendly guarantee (timestamp instead of index
> difference)
> - Implementations
> -- Ratis log now adds a timestamp of the leader for the term that will be
> used by the follower to check on the bounds
> -- When a read request is sent to the follower, the follower will check the
> last applied log's leader timestamp and only returns when current timestamp
> and the last applied log's leader timestamp is within the staleness bound
> --- If it's not, the read is either blocked (not recommended) or requeued
> (recommended) to the RPC queue
> -- For idle cluster with low writes, the last applied timestamp will not be
> advanced so the stale bounded reads might be blocked indefinitely
> --- Few approaches to handle this
> ---- Setup a side channel to propagate the leader's timestamp
> ---- Fallback to linearizable read
> ----- The argument is that since ReadIndex performance should be good if
> there are not contending writes because follower does not need to wait as the
> follower applied index should already reach the read index
> Afterwards, we can send a queryStale(Message request, long maxStalenessMs,
> long lastSeenIndex, long lastSeenTimestamp) to implement this.
> Consistency guarantee
> * Configurable bounded stale reads
> * Consistent prefix / snapshot isolation (if specified): The ordering of the
> transactions are still guaranteed (e.g. meaning if key a is created and then
> key b is created in the leader, the follower should see the same ordering)
> due to the raft sequential nature
> * Monotonic reads (if last seen index or timestamp is specified): A single
> client can see data as monotonically increasing
> Advantages
> * This removes the need to do one RTT on the ReadIndex algorithm, which
> improves latency
> * In multi-region setup, we can route the read request to the nearest
> follower instead of the leader (or follower needing to contact the leader for
> each call)
> However, this approach has some tradeoffs
> * We need to ensure that CPU clock bound drift is bounded (at least most of
> the times)
> ** AFAIK, unless we have a specialized hardware like Google TrueTime, it's
> not currently possible
> * The addition of timestamp in Raft log might adds some space and memory
> overhead
> Alternative implementations
> * CockroachDB:
> [https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20210519_bounded_staleness_reads.md]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)