kfaraz commented on code in PR #18598:
URL: https://github.com/apache/druid/pull/18598#discussion_r2409716522


##########
extensions-contrib/prometheus-emitter/src/main/java/org/apache/druid/emitter/prometheus/PrometheusEmitter.java:
##########
@@ -256,4 +279,21 @@ public void setPushGateway(PushGateway pushGateway)
   {
     this.pushGateway = pushGateway;
   }
+
+  private void cleanUpStaleMetrics()
+  {
+    if (config.getFlushPeriod() == null) {
+      return;
+    }
+
+    Map<String, DimensionsAndCollector> map = metrics.getRegisteredMetrics();
+    for (Map.Entry<String, DimensionsAndCollector> entry : map.entrySet()) {
+      if (entry.getValue().isExpired(config.getFlushPeriod())) {
+        log.debug("Metric [%s] has expired (last updated %d ms ago)",
+                 entry.getKey(),
+                 entry.getValue().getTimeSinceLastUpdate());
+        entry.getValue().getCollector().clear();

Review Comment:
   Hmm, I suppose removing the children for specific labels would be a more 
complete solution.
   But it could blow up the memory usage pretty fast since we would now be 
tracking the last updated time for every unique set of label values for every 
metric.
   
   I think the issue this PR is trying to address is that if a certain metric 
stops being reported by a service
   altogether (irrespective of label values), it should not remain in the 
collector anymore. e.g. in leadership changes.
   
   Also, the staleness concern should really only apply to gauge metrics..
   So if metrics like `task/run/time` and `query/wait/time` are still being 
updated by the current
   service (for any label), we should continue emitting them. Getting rid of 
the expired labels
   becomes more of a mem usage concern than anything else.
   
   But with gauge metrics like say `segment/size` (which represents bytes of 
segment available on a historical
   for a given datasource), I agree that the story is different.
   If all segments of a datasource are removed from a historical, we should 
definitely stop reporting `segment/size` for that datasource.
   
   But I would suggest we add a comment and postpone the full solution for 
gauge metrics for a later PR.
   Ideally, we would want to write up an embedded test (using Prometheus 
testcontainer) and verify the behaviour
   of gauge metrics in such cases and then proceed with a solution.
   
   What are your thoughts, @abhishekrb19 , @aho135 ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to