sodonnel commented on PR #3781:
URL: https://github.com/apache/ozone/pull/3781#issuecomment-1294032163

   Why use threadLocal variables in the latest patch? Does this even work? If 
another thread reads the values set by the monitor thread, does it even get the 
values, as they should be local to the thread who set them?
   
   Lets take a step back. The Decommission Monitor thread is synchronised 
already, and at the end of a run, the accumulated values for the metrics can be 
sent to the gauge, while still under the lock.
   
   In that way, a consistent set of metrics can be sent to the gauge. However 
they are being sent to the gauge over a number of calls, so the gauge could be 
snapshot halfway through, giving inconsistent results.
   
   Instead, you can make the snapshot method in the metrics class synchronized. 
Then in the sendMetricsToGauge method, you can wrap all the calls to the 
metrics class in a synchronized block too:
   
   ```
     private synchronized void setMetricsToGauge() {
       synchronized(metrics) {
         metrics.setTrackedContainersUnhealthyTotal(unhealthyContainers.get());
         metrics.setTrackedRecommissionNodesTotal(trackedRecommission.get());
         metrics.setTrackedDecommissioningMaintenanceNodesTotal(
                 trackedDecomMaintenance.get());
         metrics.setTrackedContainersUnderReplicatedTotal(
                 underReplicatedContainers.get());
         metrics.setTrackedContainersSufficientlyReplicatedTotal(
                 sufficientlyReplicatedContainers.get());
         metrics.setTrackedPipelinesWaitingToCloseTotal(pipelinesWaitingToClose
             .get());
         metrics.metricRecordOfContainerStateByHost(containerStateByHost.get());
       }
     }
   ```
   That way, we get a consistent set of metrics into the metric class, and 
block the `getMetrics(...)` method from executing when the metrics are in the 
process of being updated, preventing any race conditions.
   
   I think you should revert that latest commit, and the synchronised part as I 
suggested here, and then we can see if we can clean up the reset side of things 
(which I think you already did here).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

Reply via email to