kfaraz commented on code in PR #18598:
URL: https://github.com/apache/druid/pull/18598#discussion_r2409716522
##########
extensions-contrib/prometheus-emitter/src/main/java/org/apache/druid/emitter/prometheus/PrometheusEmitter.java:
##########
@@ -256,4 +279,21 @@ public void setPushGateway(PushGateway pushGateway)
{
this.pushGateway = pushGateway;
}
+
+ private void cleanUpStaleMetrics()
+ {
+ if (config.getFlushPeriod() == null) {
+ return;
+ }
+
+ Map<String, DimensionsAndCollector> map = metrics.getRegisteredMetrics();
+ for (Map.Entry<String, DimensionsAndCollector> entry : map.entrySet()) {
+ if (entry.getValue().isExpired(config.getFlushPeriod())) {
+ log.debug("Metric [%s] has expired (last updated %d ms ago)",
+ entry.getKey(),
+ entry.getValue().getTimeSinceLastUpdate());
+ entry.getValue().getCollector().clear();
Review Comment:
Hmm, I suppose removing the children for specific labels would be a more
complete solution.
But it could blow up the memory usage pretty fast since we would now be
tracking the last updated time for every unique set of label values for every
metric.
I think the issue this PR is trying to address is that if a certain metric
stops being reported by a service
altogether (irrespective of label values), it should not remain in the
collector anymore. e.g. in leadership changes.
Also, the staleness concern should really only apply to gauge metrics. So if
timer (histogram) metrics like `task/run/time` and `query/wait/time` are still
being updated by the current service (for any label), we should continue
emitting them. Getting rid of the expired labels becomes more of a mem usage
concern than anything else.
But with gauge metrics like say `segment/size` (which represents bytes of
segment available on a historical for a given datasource), I agree that the
story is different.
If all segments of a datasource are removed from a historical, we should
definitely stop reporting `segment/size` for that datasource.
But I would suggest we add a comment and postpone the full solution for
gauge metrics for a later PR.
Ideally, we would want to write up an embedded test (using Prometheus
testcontainer) and verify the behaviour of gauge metrics in such cases and then
proceed with a solution.
What are your thoughts, @abhishekrb19 , @aho135 ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]