[ 
https://issues.apache.org/jira/browse/HDDS-13819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Andika updated HDDS-13819:
-------------------------------
    Description: 
Currently, isLeaderReady is used to check whether a internal service should be 
started to prevent multiple service to running at the same time. For cases 
where the Ratis group is working normally (no network partitions, etc), this 
check should be fine since there should be one leader.

However, there might a case where there is a small window (within 
raft.server.rpc.timeout.max which defaults to 300ms) where there are two OM or 
SCM nodes that believe it is the leader before one steps down with 
LOST_MAJORITY_HEARTBEATS. During this period there might be two services 
running at the same time which can update the OM / SCM state.

One way is to add a short sleep before starting the service and checking the 
leadership again before starting one service run. Additionally, we can should 
also interrupt the background service if there is a leadership change.

This is only one instance, this can be expanded to a story to review and 
consolidate the consistency guarantee of OM and SCM background services. 
Ideally, there should only be one OM and SCM background service that can update 
the OM and SCM states or send the datanode commands.

  was:
Currently, isLeaderReady is used to check whether a internal service should be 
started to prevent multiple service to running at the same time. For cases 
where the Ratis group is working normally (no network partitions, etc), this 
check should be fine since there should be one leader.

However, there might a case where there is a small window (within 
raft.server.rpc.timeout.max which defaults to 300ms) where there are two OM or 
SCM nodes that believe it is the leader before one steps down with 
LOST_MAJORITY_HEARTBEATS. During this period there might be two services 
running at the same time which can update the OM / SCM state.

One way is to add a short sleep before starting the service and checking the 
leadership again before starting one service run. Additionally, we can should 
also interrupt the background service if there is a leadership change.

This is only one instance, this can be expanded to a story to review and 
consolidate the consistency guarantee of OM and SCM background services.


> Add a short wait before starting service after Ratis leadership change
> ----------------------------------------------------------------------
>
>                 Key: HDDS-13819
>                 URL: https://issues.apache.org/jira/browse/HDDS-13819
>             Project: Apache Ozone
>          Issue Type: Improvement
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>
> Currently, isLeaderReady is used to check whether a internal service should 
> be started to prevent multiple service to running at the same time. For cases 
> where the Ratis group is working normally (no network partitions, etc), this 
> check should be fine since there should be one leader.
> However, there might a case where there is a small window (within 
> raft.server.rpc.timeout.max which defaults to 300ms) where there are two OM 
> or SCM nodes that believe it is the leader before one steps down with 
> LOST_MAJORITY_HEARTBEATS. During this period there might be two services 
> running at the same time which can update the OM / SCM state.
> One way is to add a short sleep before starting the service and checking the 
> leadership again before starting one service run. Additionally, we can should 
> also interrupt the background service if there is a leadership change.
> This is only one instance, this can be expanded to a story to review and 
> consolidate the consistency guarantee of OM and SCM background services. 
> Ideally, there should only be one OM and SCM background service that can 
> update the OM and SCM states or send the datanode commands.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to