sodonnel opened a new pull request, #5458:
URL: https://github.com/apache/ozone/pull/5458

   ## What changes were proposed in this pull request?
   
   If a datanode is registered to SCM on the cluster, and then it is stopped 
and the data disks cleared and then it is restarted, it will reconnect to SCM 
as a new node with a new UUID.
   
   When this happens, the old datanode details are kept in SCM as a dead node 
and the mapping table which maps DNs running on a host to their UUIDs will 
contain two entries, leaving the decommission command unable to decide which 
entry is to be decommissioned, giving this error:
   
   ```
   2023-10-03 08:05:50,279 ERROR [IPC Server handler 25 on 
9860]-org.apache.hadoop.hdds.scm.server.SCMClientProtocolServer: Failed to 
decommission nodes
   org.apache.hadoop.hdds.scm.node.InvalidHostStringException: Host 
host1.acme.org is running multiple datanodes registered with SCM, but no port 
numbers match. Please check the port number.
           at 
org.apache.hadoop.hdds.scm.node.NodeDecommissionManager.mapHostnamesToDatanodes(NodeDecommissionManager.java:151)
           at 
org.apache.hadoop.hdds.scm.node.NodeDecommissionManager.decommissionNodes(NodeDecommissionManager.java:228)
           at 
org.apache.hadoop.hdds.scm.server.SCMClientProtocolServer.decommissionNodes(SCMClientProtocolServer.java:624)
           at 
org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.decommissionNodes(StorageContainerLocationProtocolServerSideTranslatorPB.java:1114)
           at 
org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.processRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:602)
           at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
           at 
org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.submitRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:221)
           at 
org.apache.hadoop.hdds.protocol.proto.StorageContainerLocationProtocolProtos$StorageContainerLocationProtocolService$2.callBlockingMethod(StorageContainerLocationProtocolProtos.java)
           at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
           at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
           at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
           at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
           at java.base/java.security.AccessController.doPrivileged(Native 
Method)
           at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
           at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
           at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
   ```
   
   It is valid for multiple DNs to be on the same host, especially on test 
clusters or mini-clusters. However it is not possible for a DN to be 
heartbeating from the same host with the same ports.
   
   In this case, where we try to decommission a host, and it has multiple 
entries from the same host and all the ports are the same for all entries, we 
can safely decommission the one with the newest heartbeat.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-9481
   
   ## How was this patch tested?
   
   New unit test added to reproduce and validate the fix.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to