[GitHub] [helix] rahulrane50 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

via GitHub Mon, 30 Jan 2023 14:18:11 -0800


rahulrane50 commented on code in PR #2344:
URL: https://github.com/apache/helix/pull/2344#discussion_r1091224695



##########
helix-core/src/main/java/org/apache/helix/monitoring/mbeans/ClusterStatusMonitor.java:
##########
@@ -55,6 +55,61 @@
 import org.slf4j.LoggerFactory;
 
 public class ClusterStatusMonitor implements ClusterStatusMonitorMBean {
+  private class AsyncMissingTopStateMonitor extends Thread {
+    private final ConcurrentHashMap<String, Map<String, Long>> 
_missingTopStateResourceMap;
+    private long _missingTopStateDurationThreshold = Long.MAX_VALUE;;
+
+    public AsyncMissingTopStateMonitor(ConcurrentHashMap<String, Map<String, 
Long>> missingTopStateResourceMap) {
+      _missingTopStateResourceMap = missingTopStateResourceMap;
+    }
+
+    public void setMissingTopStateDurationThreshold(long 
missingTopStateDurationThreshold) {
+      _missingTopStateDurationThreshold = missingTopStateDurationThreshold;
+    }
+
+    @Override
+    public void run() {
+      try {
+        synchronized (this) {
+          while (true) {
+            while (_missingTopStateResourceMap.size() == 0) {
+              this.wait();
+            }
+            for (Iterator<Map.Entry<String, Map<String, Long>>> 
resourcePartitionIt =
+                _missingTopStateResourceMap.entrySet().iterator(); 
resourcePartitionIt.hasNext(); ) {
+              Map.Entry<String, Map<String, Long>> resourcePartitionEntry = 
resourcePartitionIt.next();
+              // Iterate over all partitions and if any partition has missing 
top state greater than threshold then report
+              // it.
+              ResourceMonitor resourceMonitor = 
getOrCreateResourceMonitor(resourcePartitionEntry.getKey());
+              // If all partitions of resource has top state recovered then 
reset the counter
+              if (resourcePartitionEntry.getValue().isEmpty()) {
+                resourceMonitor.resetMissingTopStateDurationGuage();
+                resourcePartitionIt.remove();
+              } else {
+                for (Long missingTopStateStartTime : 
resourcePartitionEntry.getValue().values()) {
+                  if (_missingTopStateDurationThreshold < Long.MAX_VALUE && 
System.currentTimeMillis() - missingTopStateStartTime > 
_missingTopStateDurationThreshold) {
+                    
resourceMonitor.updateMissingTopStateDurationGuage(System.currentTimeMillis() - 
missingTopStateStartTime);
+                  }
+                }
+
+              }
+            }
+            sleep(50); // Instead of providing stream of durtion values to 
histogram thread can sleep in between to save some CPU cycles.

Review Comment:
   Had a offline discussion with @junkaixue . After giving it a thought i came 
with below solution please let me know if this looks okay. Summary :
   1. Original sleep here was to save few computational cycles in this tight 
for loop but it's difficult to justify correct sleep duration. 
   2. In new solution, thread will report a metric per partition only if it's 
not reported within last sliding window reset interval. Now this has few 
benefits as : one if sleeping stops thread to report metrics for all resources 
and all partitions. But that's may be wrong because what if during that sleep 
time resource has new partitions with missing top state. Hence ideally if 
thread has reported duration at least once for that partition then it can skip 
**that** partition until it's sliding window has finished. 
   @desaikomal hence i didn't add sleep here for sliding window reset time but 
used that value to determine of duration should be reported for that partition 
or not. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [helix] rahulrane50 commented on a diff in pull request #2344: Added new metric to report real time missing top state for partition

Reply via email to