sodonnel commented on code in PR #3329:
URL: https://github.com/apache/ozone/pull/3329#discussion_r861612693
##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeInfo.java:
##########
@@ -49,6 +57,7 @@ public class DatanodeInfo extends DatanodeDetails {
private List<StorageReportProto> storageReports;
private List<MetadataStorageReportProto> metadataStorageReports;
private LayoutVersionProto lastKnownLayoutVersion;
Review Comment:
NodeStateManager.checkNodesHealth is what notices the lost heartbeats and
triggers events based on that.
The DeadNodeHandler is triggered when the node goes dead (there is also a
StaleNodeHandler), and clears out its pipelines etc. Perhaps we should reset
the command counts when this happens, or perhaps it is valid to leave them as
the last known value. The datanodeInfo object is not removed AFAIK, as it holds
the DN service state (in_service, decommissioning, healthy, stale, dead etc).
If the DN comes back, it will be reset by the heartbeat processing. If it never
comes back, the datanodedetails and datanodeinfo stick around in SCM until it
is restarted.
I am not sure if the command counts remaining is a big issue, as we should
avoid scheduling commands on dead (and maybe stale) nodes anyway. Eg before
scheduling a command for a node, need to check it is HEALTHY, as otherwise the
commands will be queued in SCM and never taken by a DN. If something in SCM
keeps scheduling commands for dead nodes, it will slowly fill up the SCM memory
on the command queue.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]