Stephen O'Donnell created HDDS-9481:
---------------------------------------

             Summary: A reformatted datanode node cannot be decommissioned
                 Key: HDDS-9481
                 URL: https://issues.apache.org/jira/browse/HDDS-9481
             Project: Apache Ozone
          Issue Type: Bug
            Reporter: Stephen O'Donnell
            Assignee: Stephen O'Donnell


If a datanode is registered to SCM on the cluster, and then it is stopped and 
the data disks cleared and then it is restarted, it will reconnect to SCM as a 
new node with a new UUID.

When this happens, the old datanode details are kept in SCM as a dead node and 
the mapping table which maps DNs running on a host to their UUIDs will contain 
two entries, leaving the decommission command unable to decide which entry is 
to be decommissioned, giving this error:

{code}
2023-10-03 08:05:50,279 ERROR [IPC Server handler 25 on 
9860]-org.apache.hadoop.hdds.scm.server.SCMClientProtocolServer: Failed to 
decommission nodes
org.apache.hadoop.hdds.scm.node.InvalidHostStringException: Host host1.acme.org 
is running multiple datanodes registered with SCM, but no port numbers match. 
Please check the port number.
        at 
org.apache.hadoop.hdds.scm.node.NodeDecommissionManager.mapHostnamesToDatanodes(NodeDecommissionManager.java:151)
        at 
org.apache.hadoop.hdds.scm.node.NodeDecommissionManager.decommissionNodes(NodeDecommissionManager.java:228)
        at 
org.apache.hadoop.hdds.scm.server.SCMClientProtocolServer.decommissionNodes(SCMClientProtocolServer.java:624)
        at 
org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.decommissionNodes(StorageContainerLocationProtocolServerSideTranslatorPB.java:1114)
        at 
org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.processRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:602)
        at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
        at 
org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.submitRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:221)
        at 
org.apache.hadoop.hdds.protocol.proto.StorageContainerLocationProtocolProtos$StorageContainerLocationProtocolService$2.callBlockingMethod(StorageContainerLocationProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
        at java.base/java.security.AccessController.doPrivileged(Native Method)
        at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
{code}

It is valid for multiple DNs to be on the same host, especially on test 
clusters or mini-clusters. However it is not possible for a DN to be 
heartbeating from the same host with the same ports.

In this case, where we try to decommission a host, and it has multiple entries 
from the same host and all the ports are the same for all entries, we can 
safely decommission the one with the newest heartbeat.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to