priyeshkaratha commented on code in PR #9926:
URL: https://github.com/apache/ozone/pull/9926#discussion_r2944254112
##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/HealthyReadOnlyNodeHandler.java:
##########
@@ -96,15 +96,14 @@ public void onMessage(DatanodeDetails datanodeDetails,
}
}
- //add node back if it is not present in networkTopology
+ // Always ensure the node is in the topology. Using unconditional add
+ // rather than a contains-then-add check to avoid a race with
+ // DeadNodeHandler, which may remove the node between the check and
+ // the add. InnerNodeImpl.add() is idempotent for existing nodes.
NetworkTopology nt = nodeManager.getClusterNetworkTopologyMap();
- if (!nt.contains(datanodeDetails)) {
- nt.add(datanodeDetails);
- // make sure after DN is added back into topology, DatanodeDetails
- // instance returned from nodeStateManager has parent correctly set.
- Objects.requireNonNull(
- nodeManager.getNode(datanodeDetails.getID())
- .getParent(), "Parent == null");
- }
+ nt.add(datanodeDetails);
+ Objects.requireNonNull(
Review Comment:
The call nodeManager.getNode(datanodeDetails.getID()) could return null if
the node is concurrently removed from the NodeManager. This would cause a
NullPointerException when .getParent() is called.
better to handle like below.
```
DatanodeDetails node = nodeManager.getNode(datanodeDetails.getID());
Objects.requireNonNull(node, "Node not found in NodeManager after adding to
topology");
Objects.requireNonNull(node.getParent(), "Parent == null");
```
##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DeadNodeHandler.java:
##########
@@ -119,15 +119,24 @@ public void onMessage(final DatanodeDetails
datanodeDetails,
deletedBlockLog.onDatanodeDead(datanodeDetails.getID());
}
- //move dead datanode out of ClusterNetworkTopology
- NetworkTopology nt = nodeManager.getClusterNetworkTopologyMap();
- if (nt.contains(datanodeDetails)) {
- nt.remove(datanodeDetails);
- //make sure after DN is removed from topology,
- //DatanodeDetails instance returned from nodeStateManager has no
parent.
- Preconditions.checkState(
- nodeManager.getNode(datanodeDetails.getID())
- .getParent() == null);
+ // Only remove from topology if the node is still DEAD. Between the time
+ // the DEAD_NODE event was fired and now, the node may have been
+ // resurrected (DEAD -> HEALTHY_READONLY) via a heartbeat. Removing a
+ // resurrected node from the topology would leave it reachable but
+ // invisible to the placement policy.
+ NodeStatus currentStatus =
+ nodeManager.getNodeStatus(datanodeDetails);
+ if (currentStatus.getHealth() == HddsProtos.NodeState.DEAD) {
+ NetworkTopology nt = nodeManager.getClusterNetworkTopologyMap();
+ if (nt.contains(datanodeDetails)) {
+ nt.remove(datanodeDetails);
+ Preconditions.checkState(
Review Comment:
The call to nodeManager.getNode(datanodeDetails.getID()) could return null
if the node is concurrently removed from the NodeManager while this handler is
executing. This would lead to a NullPointerException when .getParent() is
called, which could terminate the event handler thread.
better to handle like below
```
DatanodeDetails node = nodeManager.getNode(datanodeDetails.getID());
if (node != null) {
Preconditions.checkState(node.getParent() == null);
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]