siddhantsangwan opened a new pull request, #8934:
URL: https://github.com/apache/ozone/pull/8934
## What changes were proposed in this pull request?
See the Jira for the scenario. The root cause is that when a DN dies, even
if it was in maintenance, it's removed from the network topology in
`DeadNodeHandler`. Later, when that DN is passed as a `used node` to the
placement policy for under replication handling, the policy is not able to
determine which rack the node is on (because the node was removed from
topology) and throws a runtime exception.
This scenario is a bit conflicting - on one hand we retain the replicas of a
dead maintenance node assuming it'll come back later. On the other hand, we
remove the node from topology once it has died.
To solve this, I thought of a few alternatives:
- Don't remove the maintenance node from topology once it dies.
This will _not_ cause new pipelines to include that dead node, because
they're filtered out. However I was not sure what other side effects not
removing the node would cause, and it seemed risky. In fact there was a jira
specifically for removing dead nodes from the topology. Also HDFS also removes
dead maintenance nodes from topology.
- What HDFS does: don't pass the dead maintenance node as a used node,
instead only pass it as an excluded node so that the topology doesn't try to
find out its rack. I decided to do this, but passing it as an excluded node
causes a problem where there are too many used + excluded nodes.
For example, the test
`testOneDeadMaintenanceNodeAndOneLiveMaintenanceNodeAndOneDecommissionNode`
fails if I pass the dead maintenance node as excluded node. This is because the
dead node no longer counts as a good node, but we're counting it as an excluded
node. So there are not enough good nodes:
```
2025-08-11 10:50:49,563 [IPC Server handler 43 on default port 15002] INFO
node.SCMNodeManager (SCMNodeManager.java:updateDatanodeOpState(598)) -
Scheduling a command to update the operationalState persisted on
83d15368-ca5f-41c8-b36f-7312e8313235(192.168.29.85/192.168.29.85) as the
reported value (ENTERING_MAINTENANCE, 0) does not match the value stored in SCM
(IN_MAINTENANCE, 0)
...
...
2025-08-11 10:50:51,604 [EventQueue-DeadNodeForDeadNodeHandler] INFO
node.DeadNodeHandler (DeadNodeHandler.java:onMessage(91)) - A dead datanode is
detected. 83d15368-ca5f-41c8-b36f-7312e8313235(192.168.29.85/192.168.29.85)
...
...
...
[UnderReplicatedProcessor] INFO replication.ReplicationManagerUtil
(ReplicationManagerUtil.java:getTargetDatanodes(107)) - Placement policy was
not able to return 1 nodes for container 1.
org.apache.hadoop.hdds.scm.exceptions.SCMException: No enough datanodes to
choose. TotalNode = 4 RequiredNode = 1 ExcludedNode = 2 UsedNode = 2
at
org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseDatanodesInternal(SCMContainerPlacementRackAware.java:126)
at
org.apache.hadoop.hdds.scm.SCMCommonPlacementPolicy.chooseDatanodes(SCMCommonPlacementPolicy.java:206)
at
org.apache.hadoop.hdds.scm.container.replication.ReplicationManagerUtil.getTargetDatanodes(ReplicationManagerUtil.java:103)
at
org.apache.hadoop.hdds.scm.container.replication.RatisUnderReplicationHandler.getTargets(RatisUnderReplicationHandler.java:455)
at
org.apache.hadoop.hdds.scm.container.replication.RatisUnderReplicationHandler.processAndSendCommands(RatisUnderReplicationHandler.java:128)
at
org.apache.hadoop.hdds.scm.container.replication.ReplicationManager.processUnderReplicatedContainer(ReplicationManager.java:774)
at
org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.sendDatanodeCommands(UnderReplicatedProcessor.java:60)
at
org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.sendDatanodeCommands(UnderReplicatedProcessor.java:29)
at
org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processContainer(UnhealthyReplicationProcessor.java:154)
at
org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processAll(UnhealthyReplicationProcessor.java:114)
at
org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.run(UnhealthyReplicationProcessor.java:163)
...
...
...
2025-08-11 10:50:57,584 [UnderReplicatedProcessor] ERROR
replication.UnhealthyReplicationProcessor
(UnhealthyReplicationProcessor.java:processAll(125)) - Error processing Health
result of class: class
org.apache.hadoop.hdds.scm.container.replication.ContainerHealthResult$UnderReplicatedHealthResult
for container ContainerInfo{id=#1, state=CLOSED,
stateEnterTime=2025-08-11T05:20:40.926Z,
pipelineID=Pipeline-3f793b02-46fe-471f-b5ad-464e6f10edb4,
owner=omServiceIdDefault}
org.apache.hadoop.hdds.scm.exceptions.SCMException: Placement Policy: class
org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware
did not return any nodes. Number of required Nodes 1, Data size Required:
10737418240. Container: ContainerInfo{id=#1, state=CLOSED,
stateEnterTime=2025-08-11T05:20:40.926Z,
pipelineID=Pipeline-3f793b02-46fe-471f-b5ad-464e6f10edb4,
owner=omServiceIdDefault}, Used Nodes
[1a00332b-0a4a-4f4a-b4fc-bf743a02884c(192.168.29.85/192.168.29.85)[ENTERING_MAINTENANCE],
d7426fd0-adcc-445e-9f73-da877af2c078(192.168.29.85/192.168.29.85)[IN_SERVICE]],
Excluded Nodes:
[8b07925d-e1ed-4d6c-8ae3-67032444bff1(192.168.29.85/192.168.29.85)[DECOMMISSIONING],
83d15368-ca5f-41c8-b36f-7312e8313235(192.168.29.85/192.168.29.85)[ENTERING_MAINTENANCE]].
at
org.apache.hadoop.hdds.scm.container.replication.ReplicationManagerUtil.getTargetDatanodes(ReplicationManagerUtil.java:113)
at
org.apache.hadoop.hdds.scm.container.replication.RatisUnderReplicationHandler.getTargets(RatisUnderReplicationHandler.java:455)
at
org.apache.hadoop.hdds.scm.container.replication.RatisUnderReplicationHandler.processAndSendCommands(RatisUnderReplicationHandler.java:128)
at
org.apache.hadoop.hdds.scm.container.replication.ReplicationManager.processUnderReplicatedContainer(ReplicationManager.java:774)
at
org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.sendDatanodeCommands(UnderReplicatedProcessor.java:60)
at
org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.sendDatanodeCommands(UnderReplicatedProcessor.java:29)
at
org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processContainer(UnhealthyReplicationProcessor.java:154)
at
org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processAll(UnhealthyReplicationProcessor.java:114)
at
org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.run(UnhealthyReplicationProcessor.java:163)
```
- Finally, I decided to not pass in the dead node at all. Placement policy
will automatically exclude it since it's not present in the topology,
effectively doing what HDFS does.
The test `testDeadMaintenanceNodeAndDecommission` reproduces the scenario in
the jira.
Also, while investigating this I found a related issue. Since the dead
maintenance node is removed from topology, checking its containers for mis
replication will fail, again because the node is removed from topology. I
haven't reprod this, I just got this from reading the logic in
`validateContainerPlacement`. HDFS gets around this by using node's network
location that is held in memory in the node's object instead of using the
network topology. I think we could do something similar in ozone, but this
should be investigated in a different Jira.
## What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-13544
## How was this patch tested?
Mentioned above. I tried to further add different racks for the integration
tests but that didn't work because all nodes are having the same ip address.
Draft while CI runs in my fork, otherwise it's ready for review.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]