[jira] [Comment Edited] (RATIS-2403) Improve linearizable follower read throughput instead of writes

Ivan Andika (Jira) Fri, 13 Feb 2026 02:03:33 -0800


    [ 
https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18058105#comment-18058105
 ]


Ivan Andika edited comment on RATIS-2403 at 2/13/26 10:01 AM:
--------------------------------------------------------------

[~tanxinyu] Thanks for the feedback and ideas. 

FYI my current benchmark setup:
 * Setup the baseline (leader only read and write)
 * Each benchmark is setup to have the following write/read workloads, ranging 
from write only to read only
 ** 100% Write
 ** 100% Read
 ** 10% Write, 10% Read
 ** 30% Write, 70% Read
 ** 90% Write, 10% Read
 * There will be 100 client threads, with the following configuration
 ** Random: Each client thread picks a random node (can be leader and follower)
 ** Follower only: Each client thread picks only follower

Regarding the high pressure or saturation, currently Ozone Manager is not able 
to hit the physical resources limitation (CPU, I/O, Network) since it's 
protected by backpressure mechanism like RPC queue, RPC handlers and 
synchronization mechanism like key lock. However, when enabling follower reads, 
even when there  are separate RPC queues and RPC handlers and less lock 
contention (since OM nodes do not share locks), the read throughput suffer 
quite a bit. However, when I tried to throttle the write requests, the read 
requests improved dramatically. I understand that if we push the leader to its 
limit, offloading any additional loads to follower should make the overall Raft 
group to be able to handle more throughput. Nonetheless, we also want to ensure 
that there are no throughput degradation in normal case.

Regarding the Rate Limiting, currently Ozone follows Hadoop FairCallQueue 
implementation 
([https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FairCallQueue.html)]
 where each user requests is weighted based on things like how long it holds 
the lock, etc. The user will then be deprioritized to lower queue so that (for 
example for every 1 requests served in lower queue 2, 2 requests are served in 
higher queue 1). I tried to rate limit writes and it does yields good 
improvement on read result (while writes now degrades), but the issue my 
current rate limiting is not flexible and might regress if there is a workload 
changes (e.g. more writes).

Let me try to replicate your methodology to see if it can uncover other 
bottlenecks.

[~tanxinyu] Btw, can I check whether linearizable follower read (with / without 
lease) has been widely used in production for IoTDB?  If it does, this means 
that the implementation is already production-ready and the bottleneck might be 
on Ozone-side. It will be great if you have some blogs or links regarding the 
benchmark when compared to leader-only workloads so we can see the expected 
speedup (currently my target is 1.5x-2x read throughput increase with no write 
throughput degradation).


was (Author: JIRAUSER298977):
[~tanxinyu] Thanks for the feedback and ideas. 

FYI my current benchmark setup:
 * Setup the baseline (leader only read and write)
 * Each benchmark is setup to have the following write/read workloads, ranging 
from write only to read only
 ** 100% Write
 ** 100% Read
 ** 10% Write, 10% Read
 ** 30% Write, 70% Read
 ** 90% Write, 10% Read
 * There will be 100 client threads, with the following configuration
 ** Random: Each client thread picks a random node (can be leader and follower)
 ** Follower only: Each client thread picks only follower

Regarding the high pressure or saturation, currently Ozone Manager is not able 
to hit the physical resources limitation (CPU, I/O, Network) since it's 
protected by backpressure mechanism like RPC queue, RPC handlers and 
synchronization mechanism like key lock. However, when enabling follower reads, 
even when there  are separate RPC queues and RPC handlers and less lock 
contention (since OM nodes do not share locks), the read throughput suffer 
quite a bit. However, when I tried to throttle the write requests, the read 
requests improved dramatically. 

Regarding the Rate Limiting, currently Ozone follows Hadoop FairCallQueue 
implementation 
([https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FairCallQueue.html)]
 where each user requests is weighted based on things like how long it holds 
the lock, etc. The user will then be deprioritized to lower queue so that (for 
example for every 1 requests served in lower queue 2, 2 requests are served in 
higher queue 1). I tried to rate limit writes and it does yields good 
improvement on read result (while writes now degrades), but the issue my 
current rate limiting is not flexible and might regress if there is a workload 
changes (e.g. more writes).

Let me try to replicate your methodology to see if it can uncover other 
bottlenecks.

[~tanxinyu] Btw, can I check whether linearizable follower read (with / without 
lease) has been widely used in production for IoTDB?  If it does, this means 
that the implementation is already production-ready and the bottleneck might be 
on Ozone-side. It will be great if you have some blogs or links regarding the 
benchmark when compared to leader-only workloads so we can see the expected 
speedup (currently my target is 1.5x-2x read throughput increase with no write 
throughput degradation).

> Improve linearizable follower read throughput instead of writes
> ---------------------------------------------------------------
>
>                 Key: RATIS-2403
>                 URL: https://issues.apache.org/jira/browse/RATIS-2403
>             Project: Ratis
>          Issue Type: Improvement
>            Reporter: Ivan Andika
>            Priority: Major
>         Attachments: leader-backpressure.patch
>
>
> While benchmarking linearizable follower read, the observation is that the 
> more requests go to the followers instead of the leader, the better write 
> throughput becomes, we saw around 2-3x write throughput increase compared to 
> the leader-only write and read (most likely due to less leader resource 
> contention). However, the read throughput becomes worst than leader-only 
> write and read  (some can be below 0.2x). Even with optimizations such as 
> RATIS-2392 RATIS-2382 [https://github.com/apache/ratis/pull/1334] RATIS-2379, 
> the read throughput remains worse than leader-only write (it even improves 
> the write performance instead of the read performance).
> I suspect that because write throughput increase, the read index increases at 
> a faster rate which causes follower linearizable read to wait longer.
> The target is to improve read throughput by 1.5x - 2x of the leader-only 
> write and reads. Currently pure reads (no writes) performance improves read 
> throughput up to 1.7x, but total follower read throughput is way below this 
> target.
> Currently my ideas are
>  * Sacrificing writes for reads: Can we limit the write QPS so that read QPS 
> can increase
>  ** From the benchmark, the read throughput only improves when write 
> throughput is lower
>  ** We can try to use backpressure mechanism so that writes do not advance so 
> quickly that read throughput suffer
>  *** Follower gap mechanisms (RATIS-1411), but this might cause leader to 
> stall if follower down for a while (e.g. restarted), which violates the 
> majority availability guarantee. It's also hard to know which value is 
> optimal for different workloads.
> Raising this ticket for ideas. [~szetszwo] [~tanxinyu] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (RATIS-2403) Improve linearizable follower read throughput instead of writes

Reply via email to