[PR] HDDS-13544. DN Decommission Fails When Other Datanodes Are Offline Due to Invalid Affinity Node in Ratis Replication [ozone]

via GitHub Tue, 12 Aug 2025 00:27:19 -0700


siddhantsangwan opened a new pull request, #8934:
URL: https://github.com/apache/ozone/pull/8934


   ## What changes were proposed in this pull request?
   See the Jira for the scenario. The root cause is that when a DN dies, even 
if it was in maintenance, it's removed from the network topology in 
`DeadNodeHandler`. Later, when that DN is passed as a `used node` to the 
placement policy for under replication handling, the policy is not able to 
determine which rack the node is on (because the node was removed from 
topology) and throws a runtime exception.
   
   This scenario is a bit conflicting - on one hand we retain the replicas of a 
dead maintenance node assuming it'll come back later. On the other hand, we 
remove the node from topology once it has died.
   
   To solve this, I thought of a few alternatives:
   - Don't remove the maintenance node from topology once it dies.
   This will _not_ cause new pipelines to include that dead node, because 
they're filtered out. However I was not sure what other side effects not 
removing the node would cause, and it seemed risky. In fact there was a jira 
specifically for removing dead nodes from the topology. Also HDFS also removes 
dead maintenance nodes from topology.
   
   - What HDFS does: don't pass the dead maintenance node as a used node, 
instead only pass it as an excluded node so that the topology doesn't try to 
find out its rack. I decided to do this, but passing it as an excluded node 
causes a problem where there are too many used + excluded nodes.
   
   For example, the test 
`testOneDeadMaintenanceNodeAndOneLiveMaintenanceNodeAndOneDecommissionNode` 
fails if I pass the dead maintenance node as excluded node. This is because the 
dead node no longer counts as a good node, but we're counting it as an excluded 
node. So there are not enough good nodes:
   
   ```
   2025-08-11 10:50:49,563 [IPC Server handler 43 on default port 15002] INFO  
node.SCMNodeManager (SCMNodeManager.java:updateDatanodeOpState(598)) - 
Scheduling a command to update the operationalState persisted on 
83d15368-ca5f-41c8-b36f-7312e8313235(192.168.29.85/192.168.29.85) as the 
reported value (ENTERING_MAINTENANCE, 0) does not match the value stored in SCM 
(IN_MAINTENANCE, 0)
   ...
   ...
   2025-08-11 10:50:51,604 [EventQueue-DeadNodeForDeadNodeHandler] INFO  
node.DeadNodeHandler (DeadNodeHandler.java:onMessage(91)) - A dead datanode is 
detected. 83d15368-ca5f-41c8-b36f-7312e8313235(192.168.29.85/192.168.29.85)
   ...
   ...
   ...
   [UnderReplicatedProcessor] INFO  replication.ReplicationManagerUtil 
(ReplicationManagerUtil.java:getTargetDatanodes(107)) - Placement policy was 
not able to return 1 nodes for container 1.
   org.apache.hadoop.hdds.scm.exceptions.SCMException: No enough datanodes to 
choose. TotalNode = 4 RequiredNode = 1 ExcludedNode = 2 UsedNode = 2
        at 
org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseDatanodesInternal(SCMContainerPlacementRackAware.java:126)
        at 
org.apache.hadoop.hdds.scm.SCMCommonPlacementPolicy.chooseDatanodes(SCMCommonPlacementPolicy.java:206)
        at 
org.apache.hadoop.hdds.scm.container.replication.ReplicationManagerUtil.getTargetDatanodes(ReplicationManagerUtil.java:103)
        at 
org.apache.hadoop.hdds.scm.container.replication.RatisUnderReplicationHandler.getTargets(RatisUnderReplicationHandler.java:455)
        at 
org.apache.hadoop.hdds.scm.container.replication.RatisUnderReplicationHandler.processAndSendCommands(RatisUnderReplicationHandler.java:128)
        at 
org.apache.hadoop.hdds.scm.container.replication.ReplicationManager.processUnderReplicatedContainer(ReplicationManager.java:774)
        at 
org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.sendDatanodeCommands(UnderReplicatedProcessor.java:60)
        at 
org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.sendDatanodeCommands(UnderReplicatedProcessor.java:29)
        at 
org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processContainer(UnhealthyReplicationProcessor.java:154)
        at 
org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processAll(UnhealthyReplicationProcessor.java:114)
        at 
org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.run(UnhealthyReplicationProcessor.java:163)
   ...
   ...
   ...
   2025-08-11 10:50:57,584 [UnderReplicatedProcessor] ERROR 
replication.UnhealthyReplicationProcessor 
(UnhealthyReplicationProcessor.java:processAll(125)) - Error processing Health 
result of class: class 
org.apache.hadoop.hdds.scm.container.replication.ContainerHealthResult$UnderReplicatedHealthResult
 for container ContainerInfo{id=#1, state=CLOSED, 
stateEnterTime=2025-08-11T05:20:40.926Z, 
pipelineID=Pipeline-3f793b02-46fe-471f-b5ad-464e6f10edb4, 
owner=omServiceIdDefault}
   org.apache.hadoop.hdds.scm.exceptions.SCMException: Placement Policy: class 
org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware
 did not return any nodes. Number of required Nodes 1, Data size Required: 
10737418240. Container: ContainerInfo{id=#1, state=CLOSED, 
stateEnterTime=2025-08-11T05:20:40.926Z, 
pipelineID=Pipeline-3f793b02-46fe-471f-b5ad-464e6f10edb4, 
owner=omServiceIdDefault}, Used Nodes 
[1a00332b-0a4a-4f4a-b4fc-bf743a02884c(192.168.29.85/192.168.29.85)[ENTERING_MAINTENANCE],
 
d7426fd0-adcc-445e-9f73-da877af2c078(192.168.29.85/192.168.29.85)[IN_SERVICE]], 
Excluded Nodes: 
[8b07925d-e1ed-4d6c-8ae3-67032444bff1(192.168.29.85/192.168.29.85)[DECOMMISSIONING],
 
83d15368-ca5f-41c8-b36f-7312e8313235(192.168.29.85/192.168.29.85)[ENTERING_MAINTENANCE]].
        at 
org.apache.hadoop.hdds.scm.container.replication.ReplicationManagerUtil.getTargetDatanodes(ReplicationManagerUtil.java:113)
        at 
org.apache.hadoop.hdds.scm.container.replication.RatisUnderReplicationHandler.getTargets(RatisUnderReplicationHandler.java:455)
        at 
org.apache.hadoop.hdds.scm.container.replication.RatisUnderReplicationHandler.processAndSendCommands(RatisUnderReplicationHandler.java:128)
        at 
org.apache.hadoop.hdds.scm.container.replication.ReplicationManager.processUnderReplicatedContainer(ReplicationManager.java:774)
        at 
org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.sendDatanodeCommands(UnderReplicatedProcessor.java:60)
        at 
org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.sendDatanodeCommands(UnderReplicatedProcessor.java:29)
        at 
org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processContainer(UnhealthyReplicationProcessor.java:154)
        at 
org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processAll(UnhealthyReplicationProcessor.java:114)
        at 
org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.run(UnhealthyReplicationProcessor.java:163)
   ```
   
   - Finally, I decided to not pass in the dead node at all. Placement policy 
will automatically exclude it since it's not present in the topology, 
effectively doing what HDFS does.
   
   The test `testDeadMaintenanceNodeAndDecommission` reproduces the scenario in 
the jira.
   
   
   Also, while investigating this I found a related issue. Since the dead 
maintenance node is removed from topology, checking its containers for mis 
replication will fail, again because the node is removed from topology. I 
haven't reprod this, I just got this from reading the logic in 
`validateContainerPlacement`. HDFS gets around this by using node's network 
location that is held in memory in the node's object instead of using the 
network topology. I think we could do something similar in ozone, but this 
should be investigated in a different Jira.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-13544
   
   ## How was this patch tested?
   
   Mentioned above. I tried to further add different racks for the integration 
tests but that didn't work because all nodes are having the same ip address.
   
   Draft while CI runs in my fork, otherwise it's ready for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] HDDS-13544. DN Decommission Fails When Other Datanodes Are Offline Due to Invalid Affinity Node in Ratis Replication [ozone]

Reply via email to