[
https://issues.apache.org/jira/browse/RATIS-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Andika updated RATIS-2392:
-------------------------------
Description:
This issue is found when debugging slow {{TestOzoneShellHAWithFollowerRead}}
(it was running as long as 10mins, although {{TestOzoneShellHA}} only runs for
2 minutes). It's observed that
{{OzoneManagerProtocolServerSideTranslatorPB#submitReadRequestToOM}} latency is
around 500ms (which is unacceptably long, exceeding disk latency) for some read
requests. This rules out high ReadIndex network latency since the test is run
locally.
After long investigation and debugging, the main latency is in the follower's
{{{}ReadRequests#waitForAdvance{}}}. However, the main follower bottleneck is
in {{StateMachineUpdater#waitForCommit}} instead of the previous hypotheses of
1) slow follower {{StateMachine#applyTransactions}} 2) the {{ReadIndex}}
network communication 3) leader's {{ReadIndex}} latency (which should already
be solved by RATIS-2379 and RATIS-2382.
>From the debug logs, the root cause is that the follower has not seen the
>latest leader's commitIndex (e.g. leader's commitIndex is 10, but follower's
>commitIndex is 9) and therefore the follower cannot increase its commitIndex
>and apply transactions up to the higher commitIndex (see the
>{{{}StateMachineUpdater#waitForCommit{}}}). Therefore, follower is stuck
>waiting in {{StateMachineUpdater#waitForCommit}} until the follower receives
>an AppendEntries from the leader with the leaderCommit >= readIndex. The
>leader's commitIndex is only included in the {{{}AppendEntries{}}}.
One solution is to trigger heartbeat / AppendEntries to the follower
immediately after ReadIndex is returned. Previously I was also thinking to
allow {{AppendEntriesRequestProto}} to be added to the {{ReadIndexReplyProto}}
to save the number of RPC calls, but this can cause subtle bugs and further
latency increase (follower needs to process and reply AppendEntries, if not the
leader will need to keep sending the AppendEntries).
After the improvement, the test goes down from 10 minutes to 2 minutes (similar
with {{{}TestOzoneShellHA{}}}). However, when I benchmarked the performance,
there are no significant improvements. I suspect the performance improvement is
largest if there the Ratis group is not busy (i.e. there are not a lot of
AppendEntries) since otherwise one of these AppendEntries will help to carry
the leaderCommit.
was:
This issue is found when debugging slow {{TestOzoneShellHAWithFollowerRead}}
(it was running as long as 10mins, although {{TestOzoneShellHA}} only runs for
2 minutes). It's observed that
{{OzoneManagerProtocolServerSideTranslatorPB#submitReadRequestToOM}} latency is
around 500ms (which is unacceptably long, exceeding disk latency) for some read
requests. This rules out high ReadIndex network latency since the test is run
locally.
After long investigation and debugging, the main latency is in the follower's
{{{}ReadRequests#waitForAdvance{}}}. However, the main follower bottleneck is
in {{StateMachineUpdater#waitForCommit}} instead of the previous hypotheses of
1) slow follower {{StateMachine#applyTransactions}} 2) the {{ReadIndex}}
network communication 3) leader's {{ReadIndex}} latency (which should already
be solved by RATIS-2379 and RATIS-2382.
>From the debug logs, the root cause is that the follower has not seen the
>latest leader's commitIndex (e.g. leader's commitIndex is 10, but follower's
>commitIndex is 9) and therefore the follower cannot increase its commitIndex
>and apply transactions up to the higher commitIndex (see the
>{{{}StateMachineUpdater#waitForCommit{}}}). Therefore, follower is stuck
>waiting in {{StateMachineUpdater#waitForCommit}} until the follower receives
>an AppendEntries from the leader with the leaderCommit >= readIndex. The
>leader's commitIndex is only included in the {{{}AppendEntries{}}}.
One solution is to trigger heartbeat / AppendEntries to the follower
immediately after ReadIndex is returned. Previously I was also thinking to
allow {{AppendEntriesRequestProto}} to be added to the {{ReadIndexReplyProto}}
to save the number of RPC calls, but this can cause subtle bugs and further
latency increase (follower needs to process and reply AppendEntries, if not the
leader will need to keep sending the AppendEntries).
After the improvement, the test goes down from 10 minutes to 2 minutes (similar
with {{{}TestOzoneShellHA{}}}). However, I suspect the performance improvement
is largest if there the Ratis group is not busy (i.e. there are not a lot of
AppendEntries) since otherwise one of these AppendEntries will help to carry
the leaderCommit.
> Leader should trigger heartbeat immediately after ReadIndex
> -----------------------------------------------------------
>
> Key: RATIS-2392
> URL: https://issues.apache.org/jira/browse/RATIS-2392
> Project: Ratis
> Issue Type: Improvement
> Components: Linearizable Read, performance
> Reporter: Ivan Andika
> Assignee: Ivan Andika
> Priority: Major
> Attachments: image-2026-02-04-17-01-22-314.png,
> image-2026-02-04-17-01-50-676.png, image-2026-02-04-17-02-15-168.png
>
>
> This issue is found when debugging slow {{TestOzoneShellHAWithFollowerRead}}
> (it was running as long as 10mins, although {{TestOzoneShellHA}} only runs
> for 2 minutes). It's observed that
> {{OzoneManagerProtocolServerSideTranslatorPB#submitReadRequestToOM}} latency
> is around 500ms (which is unacceptably long, exceeding disk latency) for some
> read requests. This rules out high ReadIndex network latency since the test
> is run locally.
> After long investigation and debugging, the main latency is in the follower's
> {{{}ReadRequests#waitForAdvance{}}}. However, the main follower bottleneck is
> in {{StateMachineUpdater#waitForCommit}} instead of the previous hypotheses
> of 1) slow follower {{StateMachine#applyTransactions}} 2) the {{ReadIndex}}
> network communication 3) leader's {{ReadIndex}} latency (which should already
> be solved by RATIS-2379 and RATIS-2382.
> From the debug logs, the root cause is that the follower has not seen the
> latest leader's commitIndex (e.g. leader's commitIndex is 10, but follower's
> commitIndex is 9) and therefore the follower cannot increase its commitIndex
> and apply transactions up to the higher commitIndex (see the
> {{{}StateMachineUpdater#waitForCommit{}}}). Therefore, follower is stuck
> waiting in {{StateMachineUpdater#waitForCommit}} until the follower receives
> an AppendEntries from the leader with the leaderCommit >= readIndex. The
> leader's commitIndex is only included in the {{{}AppendEntries{}}}.
> One solution is to trigger heartbeat / AppendEntries to the follower
> immediately after ReadIndex is returned. Previously I was also thinking to
> allow {{AppendEntriesRequestProto}} to be added to the
> {{ReadIndexReplyProto}} to save the number of RPC calls, but this can cause
> subtle bugs and further latency increase (follower needs to process and reply
> AppendEntries, if not the leader will need to keep sending the AppendEntries).
> After the improvement, the test goes down from 10 minutes to 2 minutes
> (similar with {{{}TestOzoneShellHA{}}}). However, when I benchmarked the
> performance, there are no significant improvements. I suspect the performance
> improvement is largest if there the Ratis group is not busy (i.e. there are
> not a lot of AppendEntries) since otherwise one of these AppendEntries will
> help to carry the leaderCommit.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)