Re: [PR] HDDS-14834. Fix race condition between DeadNodeHandler and HealthyReadOnlyNodeHandler on NetworkTopology [ozone]

via GitHub Sun, 15 Mar 2026 23:40:04 -0700


Gargi-jais11 commented on code in PR #9926:
URL: https://github.com/apache/ozone/pull/9926#discussion_r2938444012



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DeadNodeHandler.java:
##########
@@ -119,15 +119,24 @@ public void onMessage(final DatanodeDetails 
datanodeDetails,
         deletedBlockLog.onDatanodeDead(datanodeDetails.getID());
       }
 
-      //move dead datanode out of ClusterNetworkTopology
-      NetworkTopology nt = nodeManager.getClusterNetworkTopologyMap();
-      if (nt.contains(datanodeDetails)) {
-        nt.remove(datanodeDetails);
-        //make sure after DN is removed from topology,
-        //DatanodeDetails instance returned from nodeStateManager has no 
parent.
-        Preconditions.checkState(
-            nodeManager.getNode(datanodeDetails.getID())
-                .getParent() == null);
+      // Only remove from topology if the node is still DEAD. Between the time
+      // the DEAD_NODE event was fired and now, the node may have been
+      // resurrected (DEAD -> HEALTHY_READONLY) via a heartbeat. Removing a
+      // resurrected node from the topology would leave it reachable but
+      // invisible to the placement policy.
+      NodeStatus currentStatus =
+          nodeManager.getNodeStatus(datanodeDetails);
+      if (currentStatus.getHealth() == HddsProtos.NodeState.DEAD) {

Review Comment:
   @ivandika3 I was thinking would it make sense to add an early check at the 
start of `onMessage` and return if the node is no longer **DEAD**? In the race 
where the node is resurrected before this handler runs, we’d still run 
removeContainerReplicas, REPLICATION_MANAGER_NOTIFY, 
deletedBlockLog.onDatanodeDead, etc, which may not be appropriate for a 
resurrected node.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-14834. Fix race condition between DeadNodeHandler and HealthyReadOnlyNodeHandler on NetworkTopology [ozone]

Reply via email to