[
https://issues.apache.org/jira/browse/RATIS-2156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Andika updated RATIS-2156:
-------------------------------
Description:
Currently the StateMachine.LeaderEventApi#notifyFollowerSlowness is based on
raft.server.rpc.slowness.timeout, we saw that sometimes there are some cases
where the rpc rtt between the leader and follower does not exceed the timeout,
the difference of the log index between the leader and follower keeps
increasing, i.e. the slow follower cannot catch up.
In Ozone, this causes most watch request with ALL_COMMITTED replication to
timeout, causing increased latency of writes. It is better to close the
pipeline if the slow follower cannot catch up.
!image-2024-09-13-18-54-04-203.png|width=1408,height=244!
was:
Currently the StateMachine.LeaderEventApi#notifyFollowerSlowness is based on
raft.server.rpc.slowness.timeout, we saw that sometimes there are some cases
where the rpc rtt between the leader and follower does not exceed the timeout,
the difference of the log index between the leader and follower keeps
increasing, i.e. the slow follower cannot catch up.
In Ozone, this causes most watch request with ALL_COMMITTED replication to
timeout, causing increased latency of writes. It is better to close the
pipeline if the slow follower cannot catch up.
> Notify follower slowness based on the log index
> -----------------------------------------------
>
> Key: RATIS-2156
> URL: https://issues.apache.org/jira/browse/RATIS-2156
> Project: Ratis
> Issue Type: Improvement
> Reporter: Ivan Andika
> Assignee: Ivan Andika
> Priority: Major
> Attachments: image-2024-09-13-18-54-04-203.png
>
>
> Currently the StateMachine.LeaderEventApi#notifyFollowerSlowness is based on
> raft.server.rpc.slowness.timeout, we saw that sometimes there are some cases
> where the rpc rtt between the leader and follower does not exceed the
> timeout, the difference of the log index between the leader and follower
> keeps increasing, i.e. the slow follower cannot catch up.
> In Ozone, this causes most watch request with ALL_COMMITTED replication to
> timeout, causing increased latency of writes. It is better to close the
> pipeline if the slow follower cannot catch up.
> !image-2024-09-13-18-54-04-203.png|width=1408,height=244!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)