Pratyush Bhatt created HDDS-9474:
------------------------------------
Summary: [Decommissioning] All OMs fail to start with SCM version
info mismatch
Key: HDDS-9474
URL: https://issues.apache.org/jira/browse/HDDS-9474
Project: Apache Ozone
Issue Type: Bug
Components: Ozone Manager, SCM
Reporter: Pratyush Bhatt
*Scenario:* Decommission a leader SCM Node, delete the node, and later
recommission the same host back as a SCM Node.
*Steps:*
1. Transfer the leadership to a follower SCM Node.
2. Decommission the SCM Node.
3. Stop the Decommissioned SCM.
4. Delete all the SCM dirs:
['/var/lib/hadoop-ozone/scm/data', '/var/lib/hadoop-ozone/scm/ratis',
'/var/lib/hadoop-ozone/scm/ozone-metadata']
5. Deleted the SCM role.
6. Add the same host back as a SCM node.
7. Restart Ozone.
*Observed behavior:*
Ozone was able to restart, but SCM is still in safe mode.
{noformat}
2023-10-17 02:55:19,731|INFO|MainThread|machine.py:190 -
run()||GUID=c8edf40b-98ae-4412-9dbc-c04d704d442b|RUNNING:
/opt/cloudera/parcels/CDH/bin/ozone admin safemode status --verbose
2023-10-17 02:55:23,637|INFO|MainThread|machine.py:205 -
run()||GUID=c8edf40b-98ae-4412-9dbc-c04d704d442b|SCM is in safe mode.
2023-10-17 02:55:23,660|INFO|MainThread|machine.py:205 -
run()||GUID=c8edf40b-98ae-4412-9dbc-c04d704d442b|validated:false,
DataNodeSafeModeRule, registered datanodes (=0) >= required datanodes (=1)
2023-10-17 02:55:23,661|INFO|MainThread|machine.py:205 -
run()||GUID=c8edf40b-98ae-4412-9dbc-c04d704d442b|validated:true,
HealthyPipelineSafeModeRule, healthy Ratis/THREE pipelines (=0) >=
healthyPipelineThresholdCount (=0)
2023-10-17 02:55:23,661|INFO|MainThread|machine.py:205 -
run()||GUID=c8edf40b-98ae-4412-9dbc-c04d704d442b|validated:true,
ContainerSafeModeRule, % of containers with at least one reported replica
(=1.00) >= safeModeCutoff (=0.99)
2023-10-17 02:55:23,661|INFO|MainThread|machine.py:205 -
run()||GUID=c8edf40b-98ae-4412-9dbc-c04d704d442b|validated:true,
AtleastOneDatanodeReportedRule, reported Ratis/THREE pipelines with at least
one datanode (=0) >= threshold (=0){noformat}
Checked the OM Logs at the same time, throws SCM version mismatch error(All 3
OMs shuts down with same error):
{code:java}
2023-10-17 02:52:16,327 ERROR [main]-org.apache.hadoop.ozone.om.OzoneManager:
clusterId from
ozn-decom58-1.ozn-decom58.root.hwx.site:9863,ozn-decom58-2.ozn-decom58.root.hwx.site:9863,ozn-decom58-8.ozn-decom58.root.hwx.site:9863
is CID-33e18bed-b7fb-45c6-83e3-b03ce5592930, but is
CID-66a9753f-1b4b-4e0e-b529-9163fd254509 in
/var/lib/hadoop-ozone/om/data/om/current/VERSION
2023-10-17 02:52:16,329 ERROR
[main]-org.apache.hadoop.ozone.om.OzoneManagerStarter: OM start failed with
exception
SCM_VERSION_MISMATCH_ERROR org.apache.hadoop.ozone.om.exceptions.OMException:
SCM version info mismatch.
at org.apache.hadoop.ozone.om.OzoneManager.<init>(OzoneManager.java:607)
at
org.apache.hadoop.ozone.om.OzoneManager.createOm(OzoneManager.java:752)
at
org.apache.hadoop.ozone.om.OzoneManagerStarter$OMStarterHelper.start(OzoneManagerStarter.java:189)
at
org.apache.hadoop.ozone.om.OzoneManagerStarter.startOm(OzoneManagerStarter.java:86)
at
org.apache.hadoop.ozone.om.OzoneManagerStarter.call(OzoneManagerStarter.java:74)
at org.apache.hadoop.hdds.cli.GenericCli.call(GenericCli.java:38)
at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
at picocli.CommandLine.access$1300(CommandLine.java:145)
at
picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
at
picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
at picocli.CommandLine.execute(CommandLine.java:2078)
at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:100)
at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:91)
at
org.apache.hadoop.ozone.om.OzoneManagerStarter.main(OzoneManagerStarter.java:58)
2023-10-17 02:52:16,332 INFO
[shutdown-hook-0]-org.apache.hadoop.ozone.om.OzoneManagerStarter: SHUTDOWN_MSG:
/************************************************************ {code}
*Expected behavior:*
Ozone should be up and running after the recommission.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]