[jira] [Commented] (HDDS-9474) [Decommissioning] All OMs fail to start with SCM version info mismatch

Nandakumar (Jira) Tue, 17 Oct 2023 00:55:05 -0700


    [ 
https://issues.apache.org/jira/browse/HDDS-9474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17776051#comment-17776051
 ]


Nandakumar commented on HDDS-9474:
----------------------------------

This will happen if we decommission the Primordial SCM without changing the 
_ozone.scm.primordial.node.id_ and adding the same node again and performing 
_init_ on the node.

We should always perform _bootstrap_ on the newly added node. Performing _init_ 
will create a new ClusterID and consider this as a separate cluster and not 
part of the existing Ozone cluster.

The decommissioning document is fixed by HDDS-9409

> [Decommissioning] All OMs fail to start with SCM version info mismatch
> ----------------------------------------------------------------------
>
>                 Key: HDDS-9474
>                 URL: https://issues.apache.org/jira/browse/HDDS-9474
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Manager, SCM
>            Reporter: Pratyush Bhatt
>            Assignee: Nandakumar
>            Priority: Major
>
> *Scenario:* Decommission a leader SCM Node, delete the node, and later 
> recommission the same host back as a SCM Node.
> *Steps:*
> 1. Transfer the leadership to a follower SCM Node.
> 2. Decommission the SCM Node.
> 3. Stop the Decommissioned SCM.
> 4. Delete all the SCM dirs: 
> ['/var/lib/hadoop-ozone/scm/data', '/var/lib/hadoop-ozone/scm/ratis', 
> '/var/lib/hadoop-ozone/scm/ozone-metadata']
> 5. Deleted the SCM role.
> 6. Add the same host back as a SCM node.
> 7. Restart Ozone.
> *Observed behavior:*
> Ozone was able to restart, but SCM is still in safe mode.
> {noformat}
> 2023-10-17 02:55:19,731|INFO|MainThread|machine.py:190 - 
> run()||GUID=c8edf40b-98ae-4412-9dbc-c04d704d442b|RUNNING: ozone admin 
> safemode  status --verbose
> 2023-10-17 02:55:23,637|INFO|MainThread|machine.py:205 - 
> run()||GUID=c8edf40b-98ae-4412-9dbc-c04d704d442b|SCM is in safe mode.
> 2023-10-17 02:55:23,660|INFO|MainThread|machine.py:205 - 
> run()||GUID=c8edf40b-98ae-4412-9dbc-c04d704d442b|validated:false, 
> DataNodeSafeModeRule, registered datanodes (=0) >= required datanodes (=1)
> 2023-10-17 02:55:23,661|INFO|MainThread|machine.py:205 - 
> run()||GUID=c8edf40b-98ae-4412-9dbc-c04d704d442b|validated:true, 
> HealthyPipelineSafeModeRule, healthy Ratis/THREE pipelines (=0) >= 
> healthyPipelineThresholdCount (=0)
> 2023-10-17 02:55:23,661|INFO|MainThread|machine.py:205 - 
> run()||GUID=c8edf40b-98ae-4412-9dbc-c04d704d442b|validated:true, 
> ContainerSafeModeRule, % of containers with at least one reported replica 
> (=1.00) >= safeModeCutoff (=0.99)
> 2023-10-17 02:55:23,661|INFO|MainThread|machine.py:205 - 
> run()||GUID=c8edf40b-98ae-4412-9dbc-c04d704d442b|validated:true, 
> AtleastOneDatanodeReportedRule, reported Ratis/THREE pipelines with at least 
> one datanode (=0) >= threshold (=0){noformat}
> Checked the OM Logs at the same time, throws SCM version mismatch error(All 3 
> OMs shuts down with same error):
> {code:java}
> 2023-10-17 02:52:16,327 ERROR [main]-org.apache.hadoop.ozone.om.OzoneManager: 
> clusterId from 
> ozn-decom58-1.ozn-decom58.root.hwx.site:9863,ozn-decom58-2.ozn-decom58.root.hwx.site:9863,ozn-decom58-8.ozn-decom58.root.hwx.site:9863
>  is CID-33e18bed-b7fb-45c6-83e3-b03ce5592930, but is 
> CID-66a9753f-1b4b-4e0e-b529-9163fd254509 in 
> /var/lib/hadoop-ozone/om/data/om/current/VERSION
> 2023-10-17 02:52:16,329 ERROR 
> [main]-org.apache.hadoop.ozone.om.OzoneManagerStarter: OM start failed with 
> exception
> SCM_VERSION_MISMATCH_ERROR org.apache.hadoop.ozone.om.exceptions.OMException: 
> SCM version info mismatch.
>         at 
> org.apache.hadoop.ozone.om.OzoneManager.<init>(OzoneManager.java:607)
>         at 
> org.apache.hadoop.ozone.om.OzoneManager.createOm(OzoneManager.java:752)
>         at 
> org.apache.hadoop.ozone.om.OzoneManagerStarter$OMStarterHelper.start(OzoneManagerStarter.java:189)
>         at 
> org.apache.hadoop.ozone.om.OzoneManagerStarter.startOm(OzoneManagerStarter.java:86)
>         at 
> org.apache.hadoop.ozone.om.OzoneManagerStarter.call(OzoneManagerStarter.java:74)
>         at org.apache.hadoop.hdds.cli.GenericCli.call(GenericCli.java:38)
>         at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
>         at picocli.CommandLine.access$1300(CommandLine.java:145)
>         at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
>         at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
>         at picocli.CommandLine.execute(CommandLine.java:2078)
>         at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:100)
>         at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:91)
>         at 
> org.apache.hadoop.ozone.om.OzoneManagerStarter.main(OzoneManagerStarter.java:58)
> 2023-10-17 02:52:16,332 INFO 
> [shutdown-hook-0]-org.apache.hadoop.ozone.om.OzoneManagerStarter: 
> SHUTDOWN_MSG:
> /************************************************************ {code}
> *Expected behavior:*
> Ozone should be up and running after the recommission.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-9474) [Decommissioning] All OMs fail to start with SCM version info mismatch

Reply via email to