Ivan Andika created RATIS-2376:
----------------------------------
Summary: Support StateMachine#queryStale based on leader and
follower timestamp difference
Key: RATIS-2376
URL: https://issues.apache.org/jira/browse/RATIS-2376
Project: Ratis
Issue Type: New Feature
Reporter: Ivan Andika
Assignee: Ivan Andika
This is an idea that came up while discussing about improving follower reads.
In HDDS-13954, Ozone supports LocalLease feature where it uses
1. The leader's last RPC elapsed time to see if the local lease has expired
since it has not the leader has not been able to be contacted for a certain time
2. The difference between the leader's commit index and the follower's applied
index (both are configurable) so that data returned is not too stale.
The issue with (2) is that index difference have a few problems
1. Using index difference not user-friendly: Users do not know how index
difference translates to how stale a data can be. For example, we set the index
difference to be 10,000. In a idle cluster (1000 transactions / hour), this can
mean follower can return data that is 10 hours stale data while in busy cluster
(100,000 transactions / hour), this means that follower can return data that is
6 minutes (60 minutes / 10) late. There is no mechanism to allow user to set
such that it can read data that is at most 5 minutes stale.
2. Leader's commit index is only updated after AppendEntries. Therefore, if the
leader AppendEntries RPC is late, the commit index not be updated, causing
follower to think that the follower is too slow and failing the reads although
in reality the follower is caught up.
One idea is to use timestamp, instead of index differences to setup bounds on
the how stale a read can be
- Client can configure the staleness bound (e.g. 100ms means that user can
read data that at most stale for 100ms)
- 0 means read from leader all the time
- This provides a more user-friendly guarantee
- Ratis log now adds a timestamp of the leader for the term that will be used
by the follower to check on the bound
- When a read request is sent to the follower, the follower will check the
last applied log's leader timestamp and only returns when current timestamp and
the last applied log's leader timestamp is within the staleness bound
If it's not, the read is either blocked (not recommended) or requeued
(recommended) to the RPC queue
Afterwards, we can send a queryStale(Message request, long timestampDiff) to
implement this.
However, this approach has a tradeoffs
* We need to ensure that CPU clock bound drift is bounded (at least most of
the times)
** AFAIK, unless we have a specialized hardware like Google TrueTime, it's not
currently possible
* The addition of timestamp in Raft log might adds some space and memory
overhead
--
This message was sent by Atlassian Jira
(v8.20.10#820010)