[ 
https://issues.apache.org/jira/browse/RATIS-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Andika updated RATIS-2376:
-------------------------------
    Summary: Support timestamp bounded stale reads  (was: Support bounded stale 
reads)

> Support timestamp bounded stale reads
> -------------------------------------
>
>                 Key: RATIS-2376
>                 URL: https://issues.apache.org/jira/browse/RATIS-2376
>             Project: Ratis
>          Issue Type: New Feature
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>
> This is an idea that came up while discussing about improving follower reads 
> by guaranteeing bounded stale reads based on time.
> In HDDS-13954, Ozone supports LocalLease feature where it uses
>  # The leader's last RPC elapsed time to see if the local lease has expired 
> since it has not the leader has not been able to be contacted for a certain 
> time
>  # The difference between the leader's commit index and the follower's 
> applied index (both are configurable) so that data returned is not too stale.
> The issue with (2) is that index difference have a few problems
>  # Using index difference not user-friendly: Users do not know how index 
> difference translates to how stale a data can be. For example, we set the 
> index difference to be 10,000. In an idle cluster (1000 transactions / hour), 
> this can mean follower can return data that is 10 hours stale data, while in 
> busy cluster (100,000 transactions / hour), this means that follower can 
> return data that is 6 minutes (60 minutes / 10) late. There is no mechanism 
> to allow user to set such that it can read data that is at most 5 minutes 
> stale.
>  # Leader's commit index is only updated after AppendEntries. Therefore, if 
> the leader AppendEntries RPC is late, the commit index not be updated, 
> causing follower to think that the follower is too slow and failing the reads 
> although in reality the follower is caught up.
> One idea is to use timestamp differences to setup bounds on the how stale a 
> read can be
>  - Client can configure the max staleness bound (e.g. 100ms means that user 
> can read data that at most 100ms old)
>  -- 0 means will wait until the follower timestamp is the same as the 
> leader's timestamp.
>  -- This provides a more user-friendly guarantee (timestamp instead of index 
> difference)
>  - Implementations
>  -- Ratis log now adds a timestamp of the leader for the term that will be 
> used by the follower to check on the bounds
>  -- When a read request is sent to the follower, the follower will check the 
> last applied log's leader timestamp and only returns when current timestamp 
> and the last applied log's leader timestamp is within the staleness bound
>  --- If it's not, the read is either blocked (not recommended) or requeued 
> (recommended) to the RPC queue
>  -- For idle cluster with low writes, the last applied timestamp will not be 
> advanced so the stale bounded reads might be blocked indefinitely
>  --- Few approaches to handle this
>  ---- Setup a side channel to propagate the leader's timestamp
>  ---- Fallback to linearizable read
>  ----- The argument is that since ReadIndex performance should be good if 
> there are not contending writes because follower does not need to wait as the 
> follower applied index should already reach the read index
> Afterwards, we can send a queryStale(Message request, long maxStalenessMs, 
> long lastSeenIndex, long lastSeenTimestamp) to implement this. 
> Consistency guarantee
>  * Configurable bounded stale reads
>  * Consistent prefix / snapshot isolation (if specified): The ordering of the 
> transactions are still guaranteed (e.g. meaning if key a is created and then 
> key b is created in the leader, the follower should see the same ordering) 
> due to the raft sequential nature
>  * Monotonic reads (if last seen index or timestamp is specified): A single 
> client can see data as monotonically increasing
> Advantages
>  * This removes the need to do one RTT on the ReadIndex algorithm, which 
> improves latency
>  * In multi-region setup, we can route the read request to the nearest 
> follower instead of the leader (or follower needing to contact the leader for 
> each call)
> However, this approach has some tradeoffs
>  * We need to ensure that CPU clock bound drift is bounded (at least most of 
> the times)
>  ** AFAIK, unless we have a specialized hardware like Google TrueTime, it's 
> not currently possible
>  * The addition of timestamp in Raft log might adds some space and memory 
> overhead
> Alternative implementations
>  * CockroachDB: 
> [https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20210519_bounded_staleness_reads.md]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to