[
https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18058105#comment-18058105
]
Ivan Andika edited comment on RATIS-2403 at 2/13/26 10:01 AM:
--------------------------------------------------------------
[~tanxinyu] Thanks for the feedback and ideas.
FYI my current benchmark setup:
* Setup the baseline (leader only read and write)
* Each benchmark is setup to have the following write/read workloads, ranging
from write only to read only
** 100% Write
** 100% Read
** 10% Write, 10% Read
** 30% Write, 70% Read
** 90% Write, 10% Read
* There will be 100 client threads, with the following configuration
** Random: Each client thread picks a random node (can be leader and follower)
** Follower only: Each client thread picks only follower
Regarding the high pressure or saturation, currently Ozone Manager is not able
to hit the physical resources limitation (CPU, I/O, Network) since it's
protected by backpressure mechanism like RPC queue, RPC handlers and
synchronization mechanism like key lock. However, when enabling follower reads,
even when there are separate RPC queues and RPC handlers and less lock
contention (since OM nodes do not share locks), the read throughput suffer
quite a bit. However, when I tried to throttle the write requests, the read
requests improved dramatically. I understand that if we push the leader to its
limit, offloading any additional loads to follower should make the overall Raft
group to be able to handle more throughput. Nonetheless, we also want to ensure
that there are no throughput degradation in normal case.
Regarding the Rate Limiting, currently Ozone follows Hadoop FairCallQueue
implementation
([https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FairCallQueue.html)]
where each user requests is weighted based on things like how long it holds
the lock, etc. The user will then be deprioritized to lower queue so that (for
example for every 1 requests served in lower queue 2, 2 requests are served in
higher queue 1). I tried to rate limit writes and it does yields good
improvement on read result (while writes now degrades), but the issue my
current rate limiting is not flexible and might regress if there is a workload
changes (e.g. more writes).
Let me try to replicate your methodology to see if it can uncover other
bottlenecks.
[~tanxinyu] Btw, can I check whether linearizable follower read (with / without
lease) has been widely used in production for IoTDB? If it does, this means
that the implementation is already production-ready and the bottleneck might be
on Ozone-side. It will be great if you have some blogs or links regarding the
benchmark when compared to leader-only workloads so we can see the expected
speedup (currently my target is 1.5x-2x read throughput increase with no write
throughput degradation).
was (Author: JIRAUSER298977):
[~tanxinyu] Thanks for the feedback and ideas.
FYI my current benchmark setup:
* Setup the baseline (leader only read and write)
* Each benchmark is setup to have the following write/read workloads, ranging
from write only to read only
** 100% Write
** 100% Read
** 10% Write, 10% Read
** 30% Write, 70% Read
** 90% Write, 10% Read
* There will be 100 client threads, with the following configuration
** Random: Each client thread picks a random node (can be leader and follower)
** Follower only: Each client thread picks only follower
Regarding the high pressure or saturation, currently Ozone Manager is not able
to hit the physical resources limitation (CPU, I/O, Network) since it's
protected by backpressure mechanism like RPC queue, RPC handlers and
synchronization mechanism like key lock. However, when enabling follower reads,
even when there are separate RPC queues and RPC handlers and less lock
contention (since OM nodes do not share locks), the read throughput suffer
quite a bit. However, when I tried to throttle the write requests, the read
requests improved dramatically.
Regarding the Rate Limiting, currently Ozone follows Hadoop FairCallQueue
implementation
([https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FairCallQueue.html)]
where each user requests is weighted based on things like how long it holds
the lock, etc. The user will then be deprioritized to lower queue so that (for
example for every 1 requests served in lower queue 2, 2 requests are served in
higher queue 1). I tried to rate limit writes and it does yields good
improvement on read result (while writes now degrades), but the issue my
current rate limiting is not flexible and might regress if there is a workload
changes (e.g. more writes).
Let me try to replicate your methodology to see if it can uncover other
bottlenecks.
[~tanxinyu] Btw, can I check whether linearizable follower read (with / without
lease) has been widely used in production for IoTDB? If it does, this means
that the implementation is already production-ready and the bottleneck might be
on Ozone-side. It will be great if you have some blogs or links regarding the
benchmark when compared to leader-only workloads so we can see the expected
speedup (currently my target is 1.5x-2x read throughput increase with no write
throughput degradation).
> Improve linearizable follower read throughput instead of writes
> ---------------------------------------------------------------
>
> Key: RATIS-2403
> URL: https://issues.apache.org/jira/browse/RATIS-2403
> Project: Ratis
> Issue Type: Improvement
> Reporter: Ivan Andika
> Priority: Major
> Attachments: leader-backpressure.patch
>
>
> While benchmarking linearizable follower read, the observation is that the
> more requests go to the followers instead of the leader, the better write
> throughput becomes, we saw around 2-3x write throughput increase compared to
> the leader-only write and read (most likely due to less leader resource
> contention). However, the read throughput becomes worst than leader-only
> write and read (some can be below 0.2x). Even with optimizations such as
> RATIS-2392 RATIS-2382 [https://github.com/apache/ratis/pull/1334] RATIS-2379,
> the read throughput remains worse than leader-only write (it even improves
> the write performance instead of the read performance).
> I suspect that because write throughput increase, the read index increases at
> a faster rate which causes follower linearizable read to wait longer.
> The target is to improve read throughput by 1.5x - 2x of the leader-only
> write and reads. Currently pure reads (no writes) performance improves read
> throughput up to 1.7x, but total follower read throughput is way below this
> target.
> Currently my ideas are
> * Sacrificing writes for reads: Can we limit the write QPS so that read QPS
> can increase
> ** From the benchmark, the read throughput only improves when write
> throughput is lower
> ** We can try to use backpressure mechanism so that writes do not advance so
> quickly that read throughput suffer
> *** Follower gap mechanisms (RATIS-1411), but this might cause leader to
> stall if follower down for a while (e.g. restarted), which violates the
> majority availability guarantee. It's also hard to know which value is
> optimal for different workloads.
> Raising this ticket for ideas. [~szetszwo] [~tanxinyu]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)