GlenGeng opened a new pull request #2072:
URL: https://github.com/apache/ozone/pull/2072


   ## What changes were proposed in this pull request?
   
   We set up the three node SCM HA cluster for test purpose.
   From ozone dbug ldb tool, we found that the localIDs are not same between 
the three SCM. The reason is due to localID, which is initialized based on each 
machines own timestamp. 
   
   **The root cause here is:**
   when localId is not set in the sequenceId table, SCM will initialize it to 
be UniqueId.next(). When setup 3 SCM from scratch, each of them will 
individually set their localId to be their own UniqueId.next(), thus the 
sequenceId is diverged from the very beginning.
    
   **Short term solution is:**
   make the 3 empty SCM achieve an agreement about the localId.
    
   **Long tem solutos is:**
   Check HDDS-5016.
   During bootstrap, the new SCM always downloads checkpoint from leader SCM, 
and replace their own scm.db with that of leader.
    
   **The short term solution is safe:**
   upgrade in-memory scm to bypass-ratis scm: not affected.
   upgrade in-memory scm to single-node scm: not affected.
   upgrade in-memory scm to three-node scm cluster: not support yet, waits for 
long-term solution.
   setup a bypass-ratis scm: not affected.
   setup a three-node scm cluster from scratch: fix by the short term solution.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-5015
   
   ## How was this patch tested?
   
   CI and real env test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to