kfaraz commented on code in PR #18598:
URL: https://github.com/apache/druid/pull/18598#discussion_r2409716522
##########
extensions-contrib/prometheus-emitter/src/main/java/org/apache/druid/emitter/prometheus/PrometheusEmitter.java:
##########
@@ -256,4 +279,21 @@ public void setPushGateway(PushGateway pushGateway)
{
this.pushGateway = pushGateway;
}
+
+ private void cleanUpStaleMetrics()
+ {
+ if (config.getFlushPeriod() == null) {
+ return;
+ }
+
+ Map<String, DimensionsAndCollector> map = metrics.getRegisteredMetrics();
+ for (Map.Entry<String, DimensionsAndCollector> entry : map.entrySet()) {
+ if (entry.getValue().isExpired(config.getFlushPeriod())) {
+ log.debug("Metric [%s] has expired (last updated %d ms ago)",
+ entry.getKey(),
+ entry.getValue().getTimeSinceLastUpdate());
+ entry.getValue().getCollector().clear();
Review Comment:
Hmm, I suppose removing the children for specific labels would be a more
complete solution.
But it could blow up the memory usage pretty fast since we would now be
tracking the last updated time for every unique set of label values for every
metric.
I think the issue this PR is trying to address is that if a certain metric
stops being reported by a service
altogether (irrespective of label values), it should not remain in the
collector anymore. e.g. in leadership changes.
Also, the staleness concern should really only apply to gauge metrics..
So if metrics like `task/run/time` and `query/wait/time` are still being
updated by the current
service (for any label), we should continue emitting them. Getting rid of
the expired labels
becomes more of a mem usage concern than anything else.
But with gauge metrics like say `segment/size` (which represents bytes of
segment available on a historical
for a given datasource), I agree that the story is different.
If all segments of a datasource are removed from a historical, we should
definitely stop reporting `segment/size` for that datasource.
But I would suggest we add a comment and postpone the full solution for
gauge metrics for a later PR.
Ideally, we would want to write up an embedded test (using Prometheus
testcontainer) and verify the behaviour
of gauge metrics in such cases and then proceed with a solution.
What are your thoughts, @abhishekrb19 , @aho135 ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]