[
https://issues.apache.org/jira/browse/HDDS-11380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878868#comment-17878868
]
Siddhant Sangwan commented on HDDS-11380:
-----------------------------------------
The cluster has 9 DNs, 6 of which were stale when decommission was started:
{code:java}
2024-08-28 09:14:58,311 INFO
[node1-EventQueue-StaleNodeForStaleNodeHandler]-org.apache.hadoop.hdds.scm.node.StaleNodeHandler:
Datanode b3e6a332-24d8-4f8d-8e55-e08eccd09dea(a-4.b/10.140.158.7) moved to
stale state. Finalizing its pipelines
[PipelineID=4c91b14f-454c-4eba-b24b-6a9297ba0bb9,
PipelineID=615adb58-4f41-4908-bfdd-dd507a1b816e,
PipelineID=a3ee81db-3315-4062-9264-cd3ddb60a357]
2024-08-28 09:14:58,357 INFO
[node1-EventQueue-StaleNodeForStaleNodeHandler]-org.apache.hadoop.hdds.scm.node.StaleNodeHandler:
Datanode 4e197573-3f79-410f-946e-b3104cf97e9a(a-7.b/10.140.136.66) moved to
stale state. Finalizing its pipelines
[PipelineID=615adb58-4f41-4908-bfdd-dd507a1b816e,
PipelineID=a3ee81db-3315-4062-9264-cd3ddb60a357,
PipelineID=ce928f12-32ee-45e5-a6f0-b973760d9c3b]
2024-08-28 09:15:07,312 INFO
[node1-EventQueue-StaleNodeForStaleNodeHandler]-org.apache.hadoop.hdds.scm.node.StaleNodeHandler:
Datanode d6d0d2b5-dab9-48ec-8866-54b3d70f8b37(a-2.b/10.140.209.135) moved to
stale state. Finalizing its pipelines
[PipelineID=c416fd8b-2a94-47dd-98f3-8b34027a227f,
PipelineID=e10f0d4b-980c-41b9-be1b-6257b0097bcf,
PipelineID=7ef43e94-5b1d-4cf3-8fb2-700ed80ae457]
2024-08-28 09:15:10,312 INFO
[node1-EventQueue-StaleNodeForStaleNodeHandler]-org.apache.hadoop.hdds.scm.node.StaleNodeHandler:
Datanode dd2d9383-a4c8-4aa9-9290-7d5494618a3a(a-9.b/10.140.223.194) moved to
stale state. Finalizing its pipelines
[PipelineID=615adb58-4f41-4908-bfdd-dd507a1b816e,
PipelineID=e307f9dc-cc36-4807-ad7e-01f8eb708489,
PipelineID=a3ee81db-3315-4062-9264-cd3ddb60a357]
2024-08-28 09:15:10,317 INFO
[node1-EventQueue-StaleNodeForStaleNodeHandler]-org.apache.hadoop.hdds.scm.node.StaleNodeHandler:
Datanode 20213e45-1d8c-4224-96be-83d2c7f85c00(a-6.b/10.140.62.193) moved to
stale state. Finalizing its pipelines
[PipelineID=c416fd8b-2a94-47dd-98f3-8b34027a227f,
PipelineID=e10f0d4b-980c-41b9-be1b-6257b0097bcf,
PipelineID=b0935319-bdb0-4cd4-8d66-574a39f6966a]
2024-08-28 09:15:10,322 INFO
[node1-EventQueue-StaleNodeForStaleNodeHandler]-org.apache.hadoop.hdds.scm.node.StaleNodeHandler:
Datanode 693753b9-cfb5-4bcc-9863-273cf3d32d05(a-3.b/10.140.234.129) moved to
stale state. Finalizing its pipelines
[PipelineID=c416fd8b-2a94-47dd-98f3-8b34027a227f,
PipelineID=e10f0d4b-980c-41b9-be1b-6257b0097bcf,
PipelineID=56d96e20-627d-45aa-9442-a7909ce68116]
...
...
...
2024-08-28 09:18:49,459 INFO [IPC Server handler 58 on
9860]-org.apache.hadoop.hdds.scm.node.NodeDecommissionManager: Force flag =
false. Checking if decommission is possible for dns:
[92783743-a66d-435f-a4fe-0c0cc4439429(a-5.b/10.140.185.1)]
2024-08-28 09:18:49,459 INFO [IPC Server handler 58 on
9860]-org.apache.hadoop.hdds.scm.node.NodeDecommissionManager: Insufficient
nodes. Tried to decommission 1 nodes of which 1 nodes were valid. Cluster has 3
IN-SERVICE nodes, 3 of which are required for minimum replication. Failing due
to datanode : 92783743-a66d-435f-a4fe-0c0cc4439429(a-5.b/10.140.185.1),
container : #19001
2024-08-28 09:18:49,460 ERROR [IPC Server handler 58 on
9860]-org.apache.hadoop.hdds.scm.node.NodeDecommissionManager: Cannot
decommission nodes as sufficient node are not available.
{code}
A minimum of 3 DNs are required for three way replication, so another DN can't
be decommissioned at this point. This works as expected, but we should improve
the error message shown at the CLI.
> Decommissioning of DN fails immediately when network topology is enabledd
> -------------------------------------------------------------------------
>
> Key: HDDS-11380
> URL: https://issues.apache.org/jira/browse/HDDS-11380
> Project: Apache Ozone
> Issue Type: Bug
> Components: DN
> Reporter: Varsha Ravi
> Assignee: Siddhant Sangwan
> Priority: Major
>
> Decommission of DN fails immediately with the error *Insufficient nodes* when
> network topology is enabled.
> The cluster has 9 DNs spread across 5 racks.
> {noformat}
> Error: AllHosts: Insufficient nodes. Tried to decommission 1 nodes of which 1
> nodes were valid. Cluster has 3 IN-SERVICE nodes, 3 of which are required for
> minimum replication.
> java.io.IOException: Some nodes could not enter the decommission workflow
> at
> org.apache.hadoop.hdds.scm.cli.datanode.DecommissionSubCommand.execute(DecommissionSubCommand.java:80)
> at
> org.apache.hadoop.hdds.scm.cli.ScmSubcommand.call(ScmSubcommand.java:39)
> at
> org.apache.hadoop.hdds.scm.cli.ScmSubcommand.call(ScmSubcommand.java:29)
> at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
> at picocli.CommandLine.access$1500(CommandLine.java:148)
> at
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
> at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
> at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
> at
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
> at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
> at picocli.CommandLine.execute(CommandLine.java:2174)
> at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:100)
> at
> org.apache.hadoop.hdds.cli.OzoneAdmin.lambda$execute$0(OzoneAdmin.java:80)
> at
> org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:169)
> at
> org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:159)
> at org.apache.hadoop.hdds.cli.OzoneAdmin.execute(OzoneAdmin.java:79)
> at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:91)
> at
> org.apache.hadoop.hdds.cli.OzoneAdmin.main(OzoneAdmin.java:72){noformat}
> *Topology details:*
> {noformat}
> State = HEALTHY
>
> DN5:HTTPS=9883,CLIENT_RPC=19864,REPLICATION=9886,RATIS=9858,RATIS_ADMIN=9857,RATIS_SERVER=9856,STANDALONE=9859
> IN_SERVICE /rack_cu31u
>
> DN1:HTTPS=9883,CLIENT_RPC=19864,REPLICATION=9886,RATIS=9858,RATIS_ADMIN=9857,RATIS_SERVER=9856,STANDALONE=9859
> IN_SERVICE /rack_cu31u
>
> DN4:HTTPS=9883,CLIENT_RPC=19864,REPLICATION=9886,RATIS=9858,RATIS_ADMIN=9857,RATIS_SERVER=9856,STANDALONE=9859
> IN_SERVICE /rack_cu31u
>
> DN8:HTTPS=9883,CLIENT_RPC=19864,REPLICATION=9886,RATIS=9858,RATIS_ADMIN=9857,RATIS_SERVER=9856,STANDALONE=9859
> IN_SERVICE /rack_co159
>
> DN2:HTTPS=9883,CLIENT_RPC=19864,REPLICATION=9886,RATIS=9858,RATIS_ADMIN=9857,RATIS_SERVER=9856,STANDALONE=9859
> IN_SERVICE /rack_co159
>
> DN9:HTTPS=9883,CLIENT_RPC=19864,REPLICATION=9886,RATIS=9858,RATIS_ADMIN=9857,RATIS_SERVER=9856,STANDALONE=9859
> IN_SERVICE /rack_co159
>
> DN6:HTTPS=9883,CLIENT_RPC=19864,REPLICATION=9886,RATIS=9858,RATIS_ADMIN=9857,RATIS_SERVER=9856,STANDALONE=9859
> IN_SERVICE /rack_hhbkg
>
> DN7:HTTPS=9883,CLIENT_RPC=19864,REPLICATION=9886,RATIS=9858,RATIS_ADMIN=9857,RATIS_SERVER=9856,STANDALONE=9859
> IN_SERVICE /rack_eyj9h
>
> DN3:HTTPS=9883,CLIENT_RPC=19864,REPLICATION=9886,RATIS=9858,RATIS_ADMIN=9857,RATIS_SERVER=9856,STANDALONE=9859
> IN_SERVICE /rack_eka3e{noformat}
> DN to be decommissioned: DN5
> This might be due to the improvement done as part of HDDS-10462
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]