[
https://issues.apache.org/jira/browse/HDDS-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17306806#comment-17306806
]
Glen Geng edited comment on HDDS-5015 at 3/23/21, 7:00 AM:
-----------------------------------------------------------
*The root cause here is*:
when localId is not set in the sequenceId table, SCM will initialize it to be
UniqueId.next(). When setup 3 SCM from scratch, each of them will individually
set their localId to be their own UniqueId.next(), thus the sequenceId is
diverged from the very beginning.
*Short term solution is:*
make the 3 empty SCM achieve an agreement about the localId.
*Long tem solutos is:*
Check HDDS-5016.
During bootstrap, the new SCM always downloads checkpoint from leader SCM, and
replace their own scm.db with that of leader.
*The short term solution is safe:*
upgrade in-memory scm to bypass-ratis scm: not affected.
upgrade in-memory scm to single-node scm: not affected.
upgrade in-memory scm to three-node scm cluster: not support yet, waits for
long-term solution.
setup a bypass-ratis scm: not affected.
setup a three-node scm cluster from scratch: fix by the short term solution.
was (Author: glengeng):
*The root cause here is*:
when localId is not set in the sequenceId table, SCM will initialize it to be
UniqueId.next(). When setup 3 SCM from scratch, each of them will individually
set their localId to be their own UniqueId.next(), thus the sequenceId is
diverged from the very beginning.
*Short term solution is:*
make the 3 SCM has an agreement about the localId.
*Long tem solutos is:*
There will be a short term solution, and the long-term solution will be
HDDS-5016. During bootstrap, always download checkpoint from leader SCM, and
replace their own scm.db with that of leader.
*The short term solution is safe:*
upgrade in-memory scm to bypass-ratis scm: not affected.
upgrade in-memory scm to single-node scm: not affected.
upgrade in-memory scm to three-node scm cluster: not support yet.
setup a bypass-ratis scm: not affected.
setup a three-node scm cluster from scratch: fix by the short term solution.
> SequenceID is not consistent when setup a multi node SCM HA cluster.
> --------------------------------------------------------------------
>
> Key: HDDS-5015
> URL: https://issues.apache.org/jira/browse/HDDS-5015
> Project: Apache Ozone
> Issue Type: Sub-task
> Components: SCM HA
> Reporter: Xu Shao Hong
> Assignee: Glen Geng
> Priority: Major
>
> We set up the three node SCM HA cluster for test purpose.
> From ozone dbug ldb tool, we found that the sequenceIDs are not same between
> the three SCM. The reason is due to localID, which is initialized based on
> each machines own timestamp.
> The ldb result fetch from scm.db on 3 SCMs.
> *scm1*
> 17000 END
> 8000 END
> 105898712280731336 END
> *scm2*
> 17000 END
> 8000 END
> 105898723592162080 END
> *scm3*
> 17000 END
> 8000 END
> 105898724336720504 END
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]