[ 
https://issues.apache.org/jira/browse/HDDS-12109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Andika updated HDDS-12109:
-------------------------------
    Description: 
We encountered an incident where an administrator restarted an SCM and transfer 
leadership to it immediately while it's still in safe mode. The leadership was 
transferred to the SCM in safe mode. 
However, the new leader cannot serve any requests causing user write requests 
to block until the new leader SCM is out of safe mode.
We can add a mechanism to prevent transfer leadership if the target SCM is 
still in safe mode. 

This can be implemented on Ozone / Ratis side. For Ratis, the possible idea is 
to add another StateMachine API that will check whether a follower is ready for 
a leader transfer. However, I think adding a simple check of 
scmClient#inSafeMode should suffice, but we need to change it such that 
scmClient#inSafeMode won't be directed to leader.


  was:
We encountered an incident where an administrator restarted an SCM and transfer 
leadership to it immediately while it's still in safe mode. The leadership was 
transferred to the SCM in safe mode. 
However, the new leader cannot serve any requests causing user write requests 
to block until the new leader SCM is out of safe mode.
We can add a mechanism to prevent transfer leadership if the target SCM is 
still in safe mode. 

This can be implemented on Ozone / Ratis side. For Ratis, the possible idea is 
to add another StateMachine API that will check whether a follower is ready for 
a leader transfer. However, I think adding a simple check of 
scmClient#inSafeMode should suffice.



> Transfer leadership should not start until target SCM is out of safe mode
> -------------------------------------------------------------------------
>
>                 Key: HDDS-12109
>                 URL: https://issues.apache.org/jira/browse/HDDS-12109
>             Project: Apache Ozone
>          Issue Type: Sub-task
>            Reporter: Ivan Andika
>            Priority: Major
>
> We encountered an incident where an administrator restarted an SCM and 
> transfer leadership to it immediately while it's still in safe mode. The 
> leadership was transferred to the SCM in safe mode. 
> However, the new leader cannot serve any requests causing user write requests 
> to block until the new leader SCM is out of safe mode.
> We can add a mechanism to prevent transfer leadership if the target SCM is 
> still in safe mode. 
> This can be implemented on Ozone / Ratis side. For Ratis, the possible idea 
> is to add another StateMachine API that will check whether a follower is 
> ready for a leader transfer. However, I think adding a simple check of 
> scmClient#inSafeMode should suffice, but we need to change it such that 
> scmClient#inSafeMode won't be directed to leader.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to