[jira] [Commented] (HDDS-11380) Error message when DN decommissioning fails early needs to be more comprehensive

Stephen O'Donnell (Jira) Wed, 04 Sep 2024 02:25:57 -0700


    [ 
https://issues.apache.org/jira/browse/HDDS-11380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879151#comment-17879151
 ]


Stephen O'Donnell commented on HDDS-11380:
------------------------------------------

Something like this makes sense to me:

> Tried to decommission X out of Y IN_Service Nodes. Cannot decommission as a 
> minimum of ? IN_SERVICE nodes are required to maintain replication.

If there are some invalid nodes passed, another message could be printed before 
it as they are checked, eg:

> decommission of X is invalid because ...
> decommission of Y is invalid because ...
> Tried to decommission X out of Y IN_Service Nodes. Cannot decommission as a 
> minimum of ? IN_SERVICE nodes are required to maintain replication.



> Error message when DN decommissioning fails early needs to be more 
> comprehensive
> --------------------------------------------------------------------------------
>
>                 Key: HDDS-11380
>                 URL: https://issues.apache.org/jira/browse/HDDS-11380
>             Project: Apache Ozone
>          Issue Type: Improvement
>          Components: DN
>            Reporter: Varsha Ravi
>            Assignee: Varsha Ravi
>            Priority: Minor
>              Labels: pull-request-available
>
> Decommission of DN fails immediately with the error *Insufficient nodes* when 
> network topology is enabled.
> The cluster has 9 DNs spread across 5 racks.
> {noformat}
> Error: AllHosts: Insufficient nodes. Tried to decommission 1 nodes of which 1 
> nodes were valid. Cluster has 3 IN-SERVICE nodes, 3 of which are required for 
> minimum replication. 
> java.io.IOException: Some nodes could not enter the decommission workflow
>       at 
> org.apache.hadoop.hdds.scm.cli.datanode.DecommissionSubCommand.execute(DecommissionSubCommand.java:80)
>       at 
> org.apache.hadoop.hdds.scm.cli.ScmSubcommand.call(ScmSubcommand.java:39)
>       at 
> org.apache.hadoop.hdds.scm.cli.ScmSubcommand.call(ScmSubcommand.java:29)
>       at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
>       at picocli.CommandLine.access$1500(CommandLine.java:148)
>       at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
>       at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
>       at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
>       at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
>       at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
>       at picocli.CommandLine.execute(CommandLine.java:2174)
>       at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:100)
>       at 
> org.apache.hadoop.hdds.cli.OzoneAdmin.lambda$execute$0(OzoneAdmin.java:80)
>       at 
> org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:169)
>       at 
> org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:159)
>       at org.apache.hadoop.hdds.cli.OzoneAdmin.execute(OzoneAdmin.java:79)
>       at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:91)
>       at 
> org.apache.hadoop.hdds.cli.OzoneAdmin.main(OzoneAdmin.java:72){noformat}
> *Topology details:*
> {noformat}
> State = HEALTHY
>  
> DN5:HTTPS=9883,CLIENT_RPC=19864,REPLICATION=9886,RATIS=9858,RATIS_ADMIN=9857,RATIS_SERVER=9856,STANDALONE=9859
>     IN_SERVICE    /rack_cu31u
>  
> DN1:HTTPS=9883,CLIENT_RPC=19864,REPLICATION=9886,RATIS=9858,RATIS_ADMIN=9857,RATIS_SERVER=9856,STANDALONE=9859
>     IN_SERVICE    /rack_cu31u
>  
> DN4:HTTPS=9883,CLIENT_RPC=19864,REPLICATION=9886,RATIS=9858,RATIS_ADMIN=9857,RATIS_SERVER=9856,STANDALONE=9859
>     IN_SERVICE    /rack_cu31u
>  
> DN8:HTTPS=9883,CLIENT_RPC=19864,REPLICATION=9886,RATIS=9858,RATIS_ADMIN=9857,RATIS_SERVER=9856,STANDALONE=9859
>     IN_SERVICE    /rack_co159
>  
> DN2:HTTPS=9883,CLIENT_RPC=19864,REPLICATION=9886,RATIS=9858,RATIS_ADMIN=9857,RATIS_SERVER=9856,STANDALONE=9859
>     IN_SERVICE    /rack_co159
>  
> DN9:HTTPS=9883,CLIENT_RPC=19864,REPLICATION=9886,RATIS=9858,RATIS_ADMIN=9857,RATIS_SERVER=9856,STANDALONE=9859
>     IN_SERVICE    /rack_co159
>  
> DN6:HTTPS=9883,CLIENT_RPC=19864,REPLICATION=9886,RATIS=9858,RATIS_ADMIN=9857,RATIS_SERVER=9856,STANDALONE=9859
>     IN_SERVICE    /rack_hhbkg
>  
> DN7:HTTPS=9883,CLIENT_RPC=19864,REPLICATION=9886,RATIS=9858,RATIS_ADMIN=9857,RATIS_SERVER=9856,STANDALONE=9859
>     IN_SERVICE    /rack_eyj9h
>  
> DN3:HTTPS=9883,CLIENT_RPC=19864,REPLICATION=9886,RATIS=9858,RATIS_ADMIN=9857,RATIS_SERVER=9856,STANDALONE=9859
>     IN_SERVICE    /rack_eka3e{noformat}
> DN to be decommissioned: DN5
> This might be due to the improvement done as part of HDDS-10462



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-11380) Error message when DN decommissioning fails early needs to be more comprehensive

Reply via email to