[prometheus-users] Re: alertmanager - Resolved message issue

sangjae lee Thu, 15 Oct 2020 18:17:24 -0700

thx for your reply.

I test yesterday using grafana explore and below result:
-  if running docker container is stopped or killed, 'container_last_seen' 
data is only valid 5 minutes
> i guess after 5 minutes, stopped or killed container data is deleted, 
that's why resolve message is always send after 5 minutes.
> i research how to extend this 5 minutes and try so many modify config, 
testing... retry...  but this is impossible.


so, i test PromQL expression 
'*count(rate(container_last_seen{id=~"/docker/.*"}[1m])) 
< 10*'
this expression is exactly fine working.
but this expression can't present instance name, docker id... only present 
count value, so i can't know what docker instance is exactly down .

i really want solve this issue.
when docker instance down, firing and catch immediately, after docker 
instance is restart, resolve message is exactly comming.


2020년 10월 13일 화요일 오후 4시 28분 27초 UTC+9에 [email protected]님이 작성:

> Thank you.  You originally said:
>
> > and firing problems are not cleared, but resolve message is always send 
> after 5 minutes.
>
> It sounds to me like this is a staleness issue.  That is: the 
> container_last_seen{...} metric which triggered the alert is no longer 
> present in scrapes.  The PromQL rule evaluation only looks back 5 minutes 
> in time to find a data point.  Anything older than that is not found.
>
> When you have an PromQL expression like this:
>
>     expr: foo > 5
>
> it's really a chained filter:
> (1) "foo" filters down to just metrics with __name__="foo"
> (2) "> 5" further filters down to just metrics where the current value is 
> > 5
>
> The alert then fires if the filter returns one or more timeseries; and if 
> a particular timeseries triggered an alert, but subsequently vanishes, then 
> it is considered to be resolved.
>
> If a particular timeseries hasn't been seen in a scrape for more than 5 
> minutes, then it won't be returned in step (1).
>
> That's my best guess at what's going on.  To prove or disprove this, go 
> into the PromQL browser in the web interface and enter
>
> container_last_seen{id=~"/docker/.*"}[10m]
>
> This will show you the raw datapoints (values and timestamps) over the 
> last 10 minutes for that metric.  If a given timeseries stopped being 
> scraped, then you'll see no more data points added.  So the last value 
> scraped will be able to trigger an alert, but only for 5 minutes, until it 
> becomes stale.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/d838d8a4-e05e-4558-8c17-2a41fb44c8e1n%40googlegroups.com.

[prometheus-users] Re: alertmanager - Resolved message issue

Reply via email to