[prometheus-users] Blackbox prober - count bad events for latency SLI

George Brighton Tue, 01 Jun 2021 16:04:26 -0700

Hi folks,

I've written a blackbox_exporter-like process which exposes a 
probe_duration_seconds gauge. It is scraped every 15s. Given a latency 
threshold, I'd like to create recording rules to count the number of bad 
events and number of total events over the last 1m to derive a latency SLI. 
Obtaining the total number of events can be done with:


count_over_time(probe_duration_seconds[1m])

However, I'm not sure how to filter a range vector and count the remaining 
samples. My first thought was a subquery (assuming an arbitrary 2s 
threshold for a bad event):

sum_over_time((probe_duration_seconds > bool 2)[1m:15s])

During a complete outage, I've found this returns 5, whereas due to 
alignment, the total events query will almost always return 4, resulting in 
a negative ratio for the SLI. Using clamp_min() to fix this seems like a 
hack, as does using a sub-query in a recording rule.

Is there a better way than evaluating probe_duration_seconds > bool 2 in a 
separate rule group with 15s interval, then sum_over_time() the resulting 
series every 1m over the past 1m? A completely different option would be to 
have the exporter expose booleans and sum them over time, however it would 
be great to keep thresholds within Prometheus config.

Many thanks,
George

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/c361fce0-dc40-4633-b698-919e532951c4n%40googlegroups.com.

[prometheus-users] Blackbox prober - count bad events for latency SLI

Reply via email to