GlenGeng opened a new pull request #1893:
URL: https://github.com/apache/ozone/pull/1893


   ## What changes were proposed in this pull request?
   
   Scope
   admin command includes rm start/stop and safe mode exit.
    
   Requirement
   1, When admin stops rm, rm in all SCM should stop, re-election should not 
trigger rm to start in the new leader.
   2, When admin starts rm, only rm in leader and out of safe mode should take 
effect. Given leader is in safe mode, even if admin starts rm explicitly, it 
does not take effect.
   3, This admin rm start/stop can not survive restart for a SCM instance. When 
admin decides to stop rm of the SCM cluster, he should pay attention if any of 
the SCM crashes.
    
   Status
   1, For now, admin rm start/stop will create/destroy the rm thread.
   2, SCMContainerLocationFailoverProxyProvider has been proxied by 
FailoverProxyProvider, it will round robin SCMs in ozone.scm.names, until it is 
successfully handled. In ServerSide, whenever receiving a client request, it do 
isLeader check first, return nle to trigger fpp to failover to the next SCM.
   3, SCMService decides the next iteration of rm to take effect or not by 
changing RUNNING and PAUSING.
    
   Solution:
   When receiving a rm stop/start request on the server side, SCM skip the 
isLeader check, just destroys/creates rm thread, client side fake an exception 
to trigger fpp to try the next SCM in a round robin way.
   The Running and PAUSING status and rm start/stop can be treated separately. 
The admin operations and the raft status are requirements of two dimensions.
    
   We can achieve above requirements:
   1, When admin stops rm, rm in all SCM should stop, re-election should not 
trigger rm to start in the new leader.
   Meet, admin rm start destroy rm thread in all SCM.
    
   2, When admin starts rm, only rm in leader and out of safe mode should take 
effect. Given leader is in safe mode, even if admin starts it explicitly, rm 
does not take effect.
   Meet, admin rm stop create rm thread in all SCM, but SCMStatus is decided by 
leader and safe mode.
    
   3, This admin rm start/stop can not survive restart for a SCM instance. When 
admin decides to stop rm of the SCM cluster, he should pay attention if any of 
the SCM crashes.
   Meet. The is actually a relax item. 
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-4740
   
   ## How was this patch tested?
   
   CI and 3 SCM cluster.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to