[
https://issues.apache.org/jira/browse/HDDS-9481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HDDS-9481:
---------------------------------
Labels: pull-request-available (was: )
> A reformatted datanode node cannot be decommissioned
> ----------------------------------------------------
>
> Key: HDDS-9481
> URL: https://issues.apache.org/jira/browse/HDDS-9481
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Stephen O'Donnell
> Assignee: Stephen O'Donnell
> Priority: Major
> Labels: pull-request-available
>
> If a datanode is registered to SCM on the cluster, and then it is stopped and
> the data disks cleared and then it is restarted, it will reconnect to SCM as
> a new node with a new UUID.
> When this happens, the old datanode details are kept in SCM as a dead node
> and the mapping table which maps DNs running on a host to their UUIDs will
> contain two entries, leaving the decommission command unable to decide which
> entry is to be decommissioned, giving this error:
> {code}
> 2023-10-03 08:05:50,279 ERROR [IPC Server handler 25 on
> 9860]-org.apache.hadoop.hdds.scm.server.SCMClientProtocolServer: Failed to
> decommission nodes
> org.apache.hadoop.hdds.scm.node.InvalidHostStringException: Host
> host1.acme.org is running multiple datanodes registered with SCM, but no port
> numbers match. Please check the port number.
> at
> org.apache.hadoop.hdds.scm.node.NodeDecommissionManager.mapHostnamesToDatanodes(NodeDecommissionManager.java:151)
> at
> org.apache.hadoop.hdds.scm.node.NodeDecommissionManager.decommissionNodes(NodeDecommissionManager.java:228)
> at
> org.apache.hadoop.hdds.scm.server.SCMClientProtocolServer.decommissionNodes(SCMClientProtocolServer.java:624)
> at
> org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.decommissionNodes(StorageContainerLocationProtocolServerSideTranslatorPB.java:1114)
> at
> org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.processRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:602)
> at
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
> at
> org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.submitRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:221)
> at
> org.apache.hadoop.hdds.protocol.proto.StorageContainerLocationProtocolProtos$StorageContainerLocationProtocolService$2.callBlockingMethod(StorageContainerLocationProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
> at java.base/java.security.AccessController.doPrivileged(Native
> Method)
> at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
> {code}
> It is valid for multiple DNs to be on the same host, especially on test
> clusters or mini-clusters. However it is not possible for a DN to be
> heartbeating from the same host with the same ports.
> In this case, where we try to decommission a host, and it has multiple
> entries from the same host and all the ports are the same for all entries, we
> can safely decommission the one with the newest heartbeat.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]