Hi folks, I've written a blackbox_exporter-like process which exposes a probe_duration_seconds gauge. It is scraped every 15s. Given a latency threshold, I'd like to create recording rules to count the number of bad events and number of total events over the last 1m to derive a latency SLI. Obtaining the total number of events can be done with:
count_over_time(probe_duration_seconds[1m]) However, I'm not sure how to filter a range vector and count the remaining samples. My first thought was a subquery (assuming an arbitrary 2s threshold for a bad event): sum_over_time((probe_duration_seconds > bool 2)[1m:15s]) During a complete outage, I've found this returns 5, whereas due to alignment, the total events query will almost always return 4, resulting in a negative ratio for the SLI. Using clamp_min() to fix this seems like a hack, as does using a sub-query in a recording rule. Is there a better way than evaluating probe_duration_seconds > bool 2 in a separate rule group with 15s interval, then sum_over_time() the resulting series every 1m over the past 1m? A completely different option would be to have the exporter expose booleans and sum them over time, however it would be great to keep thresholds within Prometheus config. Many thanks, George -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/c361fce0-dc40-4633-b698-919e532951c4n%40googlegroups.com.

