sodonnel opened a new pull request, #5458:
URL: https://github.com/apache/ozone/pull/5458
## What changes were proposed in this pull request?
If a datanode is registered to SCM on the cluster, and then it is stopped
and the data disks cleared and then it is restarted, it will reconnect to SCM
as a new node with a new UUID.
When this happens, the old datanode details are kept in SCM as a dead node
and the mapping table which maps DNs running on a host to their UUIDs will
contain two entries, leaving the decommission command unable to decide which
entry is to be decommissioned, giving this error:
```
2023-10-03 08:05:50,279 ERROR [IPC Server handler 25 on
9860]-org.apache.hadoop.hdds.scm.server.SCMClientProtocolServer: Failed to
decommission nodes
org.apache.hadoop.hdds.scm.node.InvalidHostStringException: Host
host1.acme.org is running multiple datanodes registered with SCM, but no port
numbers match. Please check the port number.
at
org.apache.hadoop.hdds.scm.node.NodeDecommissionManager.mapHostnamesToDatanodes(NodeDecommissionManager.java:151)
at
org.apache.hadoop.hdds.scm.node.NodeDecommissionManager.decommissionNodes(NodeDecommissionManager.java:228)
at
org.apache.hadoop.hdds.scm.server.SCMClientProtocolServer.decommissionNodes(SCMClientProtocolServer.java:624)
at
org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.decommissionNodes(StorageContainerLocationProtocolServerSideTranslatorPB.java:1114)
at
org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.processRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:602)
at
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
at
org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.submitRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:221)
at
org.apache.hadoop.hdds.protocol.proto.StorageContainerLocationProtocolProtos$StorageContainerLocationProtocolService$2.callBlockingMethod(StorageContainerLocationProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
at java.base/java.security.AccessController.doPrivileged(Native
Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
```
It is valid for multiple DNs to be on the same host, especially on test
clusters or mini-clusters. However it is not possible for a DN to be
heartbeating from the same host with the same ports.
In this case, where we try to decommission a host, and it has multiple
entries from the same host and all the ports are the same for all entries, we
can safely decommission the one with the newest heartbeat.
## What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-9481
## How was this patch tested?
New unit test added to reproduce and validate the fix.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]