The fundamental problem is how Prometheus can know which containers should
be there. Considering your regex, there is an infinite number of containers
that are "absent": 0dev-4, 1dev-4, … 9999dev-4, …fjdhrhfksnhdev-4 etc.

To solve this, you need a list of concretely expected containers somewhere.
That could be separate alerts if the number is small, or some metric that
is there even when the container is stopped. In that case you can use the
unless operator:

all_expected_containers unless on(name) container_start_time_seconds

If there is not already such a metric, you could generate it using
recording rules (again requires listing them out but is less verbose),
write a small exporter that gets the data from your source of truth, or use

container_start_time_seconds offset 15m

to look for containers that have been running before and now are not. The
downside of this is that it is noisy when a container is expected to go
away, and these alerts "resolve" after 15m whether the container is back up
or not.

/MR


On Mon, Mar 8, 2021, 15:39 Tamar <[email protected]> wrote:

> Hi,
>
> I am trying to create an alert for stopped containers.
>
> If I am using the exact container name I have no problem:
>
>  -* alert: ContainerKilled*
> *    expr:  absent(container_start_time_seconds{name="be-dev-4"})*
> *    for: 15m*
> *    labels:*
> *      severity: 'warning'*
> *    annotations:*
> *      summary: 'Container killed'*
> *      description: 'A container{{ $labels.name <http://labels.name> }}
> has disappeared'*
>
> However, if i am trying to use regexp for the container name (as I have a
> few containers with this suffix) , then it fails whatever I try -
> If I use this, then no alert is sent:
>  * - alert: ContainerKilled*
> *    expr:  absent(container_start_time_seconds{** name=~".*dev-4"})*
> *    for: 15m*
> *    labels:*
> *      severity: 'warning'*
>
> *    annotations:*
> *      summary: 'Container killed'*
> *      description: 'A container{{ $labels.name <http://labels.name> }}
> has disappeared'*
>
> If I use this, then alert is sent, but without the stopped container name:
>  - alert: ContainerKilled2
>     expr:  absent(container_start_time_seconds{*name=~".*dev-4"}*)
>     for: 15m
>     labels:
>       severity: 'warning'
>     annotations:
>       summary: 'Container killed'
>       *description: 'A container has disappeared {{ $labels.instance }}
> of job {{ $labels.job }}'*
>
> Any idea how to alert then with a regexp, *and *the container name?
>
> Thanks
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/24fe1ee7-3275-4747-93b9-9f0f51821533n%40googlegroups.com
> <https://groups.google.com/d/msgid/prometheus-users/24fe1ee7-3275-4747-93b9-9f0f51821533n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CAMV%3D_gaYzw8Rx2St0mW3qsEP7He%3DmZK%3DhE9H1iTraZyB3Kcj-w%40mail.gmail.com.

Reply via email to