sodonnel commented on code in PR #3781: URL: https://github.com/apache/ozone/pull/3781#discussion_r1005334878
########## hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java: ########## @@ -168,6 +232,55 @@ public int getTrackedNodeCount() { return trackedNodes.size(); } + synchronized void setMetricsToGauge() { + metrics.setTrackedContainersUnhealthyTotal(unhealthyContainers); + metrics.setTrackedRecommissionNodesTotal(trackedRecommission); + metrics.setTrackedDecommissioningMaintenanceNodesTotal( + trackedDecomMaintenance); + metrics.setTrackedContainersUnderReplicatedTotal( + underReplicatedContainers); + metrics.setTrackedContainersSufficientlyReplicatedTotal( + sufficientlyReplicatedContainers); + metrics.setTrackedPipelinesWaitingToCloseTotal(pipelinesWaitingToClose); + for (Map.Entry<String, ContainerStateInWorkflow> e : + containerStateByHost.entrySet()) { + metrics.metricRecordOfContainerStateByHost(e.getKey(), + e.getValue().sufficientlyReplicated, + e.getValue().underReplicatedContainers, + e.getValue().unhealthyContainers, + e.getValue().pipelinesWaitingToClose); + } + } + + void resetContainerMetrics() { + pipelinesWaitingToClose = 0; + sufficientlyReplicatedContainers = 0; + unhealthyContainers = 0; + underReplicatedContainers = 0; + + for (Map.Entry<String, ContainerStateInWorkflow> e : Review Comment: > On each scheduled run of the monitor, the implementation captures the current workflow state completely prior to flushing the metric update to the NodeDecommissionMetric object (DatanodeAdminMonitorImpl.setMetricsToGauge). In doing so, it tries to keep each metric refresh pull from getMetrics display a full snapshot of the last captured run of the monitor. The code in `DatanodeAdminMonitorImpl.setMetricsToGauge` is: ``` synchronized void setMetricsToGauge() { metrics.setTrackedContainersUnhealthyTotal(unhealthyContainers); metrics.setTrackedRecommissionNodesTotal(trackedRecommission); metrics.setTrackedDecommissioningMaintenanceNodesTotal( trackedDecomMaintenance); metrics.setTrackedContainersUnderReplicatedTotal( underReplicatedContainers); metrics.setTrackedContainersSufficientlyReplicatedTotal( sufficientlyReplicatedContainers); metrics.setTrackedPipelinesWaitingToCloseTotal(pipelinesWaitingToClose); for (Map.Entry<String, ContainerStateInWorkflow> e : ... ``` It makes multiple calls to `metrics` and there is nothing stopping getMetrics() being called on the metrics object by another thread half way through the execution of `setMetricsToGauge`. This means the metrics can still be snapshot inconsistently. To make it consistent, you need to synchronize in the metrics object and set ALL the metrics in a single synchronized call. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org