Ivan Andika created HDDS-13819:
----------------------------------
Summary: Add a short wait before starting service after Ratis
leadership change
Key: HDDS-13819
URL: https://issues.apache.org/jira/browse/HDDS-13819
Project: Apache Ozone
Issue Type: Improvement
Reporter: Ivan Andika
Assignee: Ivan Andika
Currently, isLeaderReady is used to check whether a internal service should be
started to prevent multiple service to running at the same time. For cases
where the Ratis group is working normally (no network partitions, etc), this
check should be fine since there should be one leader.
However, there might a case where there is a small window (within
raft.server.rpc.timeout.max which defaults to 300ms) where there are two OM or
SCM nodes that believe it is the leader before one steps down with
LOST_MAJORITY_HEARTBEATS. During this period there might be two services
running at the same time which can update the OM / SCM state.
One way is to add a short sleep before starting the service and checking the
leadership again before starting one service run. Additionally, we can should
also interrupt the background service if there is a leadership change.
This is only one instance, this can be expanded to a story to review and
consolidate the consistency guarantee of OM and SCM background services.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]