[
https://issues.apache.org/jira/browse/RATIS-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Andika updated RATIS-2376:
-------------------------------
Description:
This is an idea that came up while discussing about improving follower reads by
guaranteeing bounded stale reads based on time.
In HDDS-13954, Ozone supports LocalLease feature where it uses
# The leader's last RPC elapsed time to see if the local lease has expired
since it has not the leader has not been able to be contacted for a certain time
# The difference between the leader's commit index and the follower's applied
index (both are configurable) so that data returned is not too stale.
The issue with (2) is that index difference have a few problems
# Using index difference not user-friendly: Users do not know how index
difference translates to how stale a data can be. For example, we set the index
difference to be 10,000. In an idle cluster (1000 transactions / hour), this
can mean follower can return data that is 10 hours stale data, while in busy
cluster (100,000 transactions / hour), this means that follower can return data
that is 6 minutes (60 minutes / 10) late. There is no mechanism to allow user
to set such that it can read data that is at most 5 minutes stale.
# Leader's commit index is only updated after AppendEntries. Therefore, if the
leader AppendEntries RPC is late, the commit index not be updated, causing
follower to think that the follower is too slow and failing the reads although
in reality the follower is caught up.
One idea is to use timestamp differences to setup bounds on the how stale a
read can be
- Client can configure the max staleness bound (e.g. 100ms means that user can
read data that at most 100ms old)
-- 0 means will wait until the follower timestamp is the same as the leader's
timestamp.
-- This provides a more user-friendly guarantee (timestamp instead of index
difference)
- Implementations
-- Ratis log now adds a timestamp of the leader for the term that will be used
by the follower to check on the bounds
-- When a read request is sent to the follower, the follower will check the
last applied log's leader timestamp and only returns when current timestamp and
the last applied log's leader timestamp is within the staleness bound
--- If it's not, the read is either blocked (not recommended) or requeued
(recommended) to the RPC queue
-- For idle cluster with low writes, the last applied timestamp will not be
advanced so the stale bounded reads might be blocked indefinitely
--- Few approaches to handle this
---- Setup a side channel to propagate the leader's timestamp
---- Fallback to linearizable read
----- The argument is that since ReadIndex performance should be good if there
are not contending writes because follower does not need to wait as the
follower applied index should already reach the read index
Afterwards, we can send a queryStale(Message request, long maxStalenessMs, long
lastSeenIndex, long lastSeenTimestamp) to implement this.
Consistency guarantee
* Configurable bounded stale reads
* Consistent prefix / snapshot isolation (if specified): The ordering of the
transactions are still guaranteed (e.g. meaning if key a is created and then
key b is created in the leader, the follower should see the same ordering) due
to the raft sequential nature
* Monotonic reads (if last seen index or timestamp is specified): A single
client can see data as monotonically increasing
Advantages
* This removes the need to do one RTT on the ReadIndex algorithm, which
improves latency
* In multi-region setup, we can route the read request to the nearest follower
instead of the leader (or follower needing to contact the leader for each call)
However, this approach has some tradeoffs
* We need to ensure that CPU clock bound drift is bounded (at least most of
the times)
** AFAIK, unless we have a specialized hardware like Google TrueTime, it's not
currently possible
* The addition of timestamp in Raft log might adds some space and memory
overhead
Alternative implementations
* CockroachDB:
[https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20210519_bounded_staleness_reads.md]
was:
This is an idea that came up while discussing about improving follower reads by
guaranteeing bounded stale reads based on time.
In HDDS-13954, Ozone supports LocalLease feature where it uses
# The leader's last RPC elapsed time to see if the local lease has expired
since it has not the leader has not been able to be contacted for a certain time
# The difference between the leader's commit index and the follower's applied
index (both are configurable) so that data returned is not too stale.
The issue with (2) is that index difference have a few problems
# Using index difference not user-friendly: Users do not know how index
difference translates to how stale a data can be. For example, we set the index
difference to be 10,000. In an idle cluster (1000 transactions / hour), this
can mean follower can return data that is 10 hours stale data, while in busy
cluster (100,000 transactions / hour), this means that follower can return data
that is 6 minutes (60 minutes / 10) late. There is no mechanism to allow user
to set such that it can read data that is at most 5 minutes stale.
# Leader's commit index is only updated after AppendEntries. Therefore, if the
leader AppendEntries RPC is late, the commit index not be updated, causing
follower to think that the follower is too slow and failing the reads although
in reality the follower is caught up.
One idea is to use timestamp differences to setup bounds on the how stale a
read can be
- Client can configure the max staleness bound (e.g. 100ms means that user can
read data that at most 100ms old)
-- 0 means will wait until the follower timestamp is the same as the leader's
timestamp.
-- This provides a more user-friendly guarantee (timestamp instead of index
difference)
- Implementations
-- Ratis log now adds a timestamp of the leader for the term that will be used
by the follower to check on the bounds
-- When a read request is sent to the follower, the follower will check the
last applied log's leader timestamp and only returns when current timestamp and
the last applied log's leader timestamp is within the staleness bound
--- If it's not, the read is either blocked (not recommended) or requeued
(recommended) to the RPC queue
Afterwards, we can send a queryStale(Message request, long maxStalenessMs, long
lastSeenIndex, long lastSeenTimestamp) to implement this.
Consistency guarantee
* Configurable bounded stale reads
* Consistent prefix / snapshot isolation (if specified): The ordering of the
transactions are still guaranteed (e.g. meaning if key a is created and then
key b is created in the leader, the follower should see the same ordering) due
to the raft sequential nature
* Monotonic reads (if last seen index or timestamp is specified): A single
client can see data as monotonically increasing
Advantages
* This removes the need to do one RTT on the ReadIndex algorithm, which
improves latency
* In multi-region setup, we can route the read request to the nearest follower
instead of the leader (or follower needing to contact the leader for each call)
However, this approach has some tradeoffs
* We need to ensure that CPU clock bound drift is bounded (at least most of
the times)
** AFAIK, unless we have a specialized hardware like Google TrueTime, it's not
currently possible
* The addition of timestamp in Raft log might adds some space and memory
overhead
Alternative implementations
* CockroachDB:
[https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20210519_bounded_staleness_reads.md]
> Support bounded stale reads
> ---------------------------
>
> Key: RATIS-2376
> URL: https://issues.apache.org/jira/browse/RATIS-2376
> Project: Ratis
> Issue Type: New Feature
> Reporter: Ivan Andika
> Assignee: Ivan Andika
> Priority: Major
>
> This is an idea that came up while discussing about improving follower reads
> by guaranteeing bounded stale reads based on time.
> In HDDS-13954, Ozone supports LocalLease feature where it uses
> # The leader's last RPC elapsed time to see if the local lease has expired
> since it has not the leader has not been able to be contacted for a certain
> time
> # The difference between the leader's commit index and the follower's
> applied index (both are configurable) so that data returned is not too stale.
> The issue with (2) is that index difference have a few problems
> # Using index difference not user-friendly: Users do not know how index
> difference translates to how stale a data can be. For example, we set the
> index difference to be 10,000. In an idle cluster (1000 transactions / hour),
> this can mean follower can return data that is 10 hours stale data, while in
> busy cluster (100,000 transactions / hour), this means that follower can
> return data that is 6 minutes (60 minutes / 10) late. There is no mechanism
> to allow user to set such that it can read data that is at most 5 minutes
> stale.
> # Leader's commit index is only updated after AppendEntries. Therefore, if
> the leader AppendEntries RPC is late, the commit index not be updated,
> causing follower to think that the follower is too slow and failing the reads
> although in reality the follower is caught up.
> One idea is to use timestamp differences to setup bounds on the how stale a
> read can be
> - Client can configure the max staleness bound (e.g. 100ms means that user
> can read data that at most 100ms old)
> -- 0 means will wait until the follower timestamp is the same as the
> leader's timestamp.
> -- This provides a more user-friendly guarantee (timestamp instead of index
> difference)
> - Implementations
> -- Ratis log now adds a timestamp of the leader for the term that will be
> used by the follower to check on the bounds
> -- When a read request is sent to the follower, the follower will check the
> last applied log's leader timestamp and only returns when current timestamp
> and the last applied log's leader timestamp is within the staleness bound
> --- If it's not, the read is either blocked (not recommended) or requeued
> (recommended) to the RPC queue
> -- For idle cluster with low writes, the last applied timestamp will not be
> advanced so the stale bounded reads might be blocked indefinitely
> --- Few approaches to handle this
> ---- Setup a side channel to propagate the leader's timestamp
> ---- Fallback to linearizable read
> ----- The argument is that since ReadIndex performance should be good if
> there are not contending writes because follower does not need to wait as the
> follower applied index should already reach the read index
> Afterwards, we can send a queryStale(Message request, long maxStalenessMs,
> long lastSeenIndex, long lastSeenTimestamp) to implement this.
> Consistency guarantee
> * Configurable bounded stale reads
> * Consistent prefix / snapshot isolation (if specified): The ordering of the
> transactions are still guaranteed (e.g. meaning if key a is created and then
> key b is created in the leader, the follower should see the same ordering)
> due to the raft sequential nature
> * Monotonic reads (if last seen index or timestamp is specified): A single
> client can see data as monotonically increasing
> Advantages
> * This removes the need to do one RTT on the ReadIndex algorithm, which
> improves latency
> * In multi-region setup, we can route the read request to the nearest
> follower instead of the leader (or follower needing to contact the leader for
> each call)
> However, this approach has some tradeoffs
> * We need to ensure that CPU clock bound drift is bounded (at least most of
> the times)
> ** AFAIK, unless we have a specialized hardware like Google TrueTime, it's
> not currently possible
> * The addition of timestamp in Raft log might adds some space and memory
> overhead
> Alternative implementations
> * CockroachDB:
> [https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20210519_bounded_staleness_reads.md]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)