sodonnel commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r1005334878


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -168,6 +232,55 @@ public int getTrackedNodeCount() {
     return trackedNodes.size();
   }
 
+  synchronized void setMetricsToGauge() {
+    metrics.setTrackedContainersUnhealthyTotal(unhealthyContainers);
+    metrics.setTrackedRecommissionNodesTotal(trackedRecommission);
+    metrics.setTrackedDecommissioningMaintenanceNodesTotal(
+            trackedDecomMaintenance);
+    metrics.setTrackedContainersUnderReplicatedTotal(
+            underReplicatedContainers);
+    metrics.setTrackedContainersSufficientlyReplicatedTotal(
+            sufficientlyReplicatedContainers);
+    metrics.setTrackedPipelinesWaitingToCloseTotal(pipelinesWaitingToClose);
+    for (Map.Entry<String, ContainerStateInWorkflow> e :
+            containerStateByHost.entrySet()) {
+      metrics.metricRecordOfContainerStateByHost(e.getKey(),
+          e.getValue().sufficientlyReplicated,
+          e.getValue().underReplicatedContainers,
+          e.getValue().unhealthyContainers,
+          e.getValue().pipelinesWaitingToClose);
+    }
+  }
+
+  void resetContainerMetrics() {
+    pipelinesWaitingToClose = 0;
+    sufficientlyReplicatedContainers = 0;
+    unhealthyContainers = 0;
+    underReplicatedContainers = 0;
+
+    for (Map.Entry<String, ContainerStateInWorkflow> e :

Review Comment:
   > On each scheduled run of the monitor, the implementation captures the 
current workflow state completely prior to flushing the metric update to the 
NodeDecommissionMetric object (DatanodeAdminMonitorImpl.setMetricsToGauge). In 
doing so, it tries to keep each metric refresh pull from getMetrics display a 
full snapshot of the last captured run of the monitor.
   
   The code in `DatanodeAdminMonitorImpl.setMetricsToGauge` is:
   
   ```
   synchronized void setMetricsToGauge() {
       metrics.setTrackedContainersUnhealthyTotal(unhealthyContainers);
       metrics.setTrackedRecommissionNodesTotal(trackedRecommission);
       metrics.setTrackedDecommissioningMaintenanceNodesTotal(
               trackedDecomMaintenance);
       metrics.setTrackedContainersUnderReplicatedTotal(
               underReplicatedContainers);
       metrics.setTrackedContainersSufficientlyReplicatedTotal(
               sufficientlyReplicatedContainers);
       metrics.setTrackedPipelinesWaitingToCloseTotal(pipelinesWaitingToClose);
       for (Map.Entry<String, ContainerStateInWorkflow> e :
   ...
   ```
   
   It makes multiple calls to `metrics` and there is nothing stopping 
getMetrics() being called on the metrics object by another thread half way 
through the execution of `setMetricsToGauge`. This means the metrics can still 
be snapshot inconsistently.
   
   To make it consistent, you need to synchronize in the metrics object and set 
ALL the metrics in a single synchronized call.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

Reply via email to