Finally came up with a solution for this:
var interval = 5s
var threshold_warn = 65s
var threshold_crit = 305s
// Note: the alert will come in between (threshold + interval) and
(threshold + interval * 2).
// This should be improved by
https://github.com/influxdata/kapacitor/issues/898
var data = stream
|from().measurement('system').groupBy('host')
|stats(interval).align()
|derivative('emitted').unit(interval).as('emitted') // can't use
difference() because of https://github.com/influxdata/kapacitor/issues/904
var data_warn_window = data
|window().period(threshold_warn).every(1u)
var data_warn_size = data_warn_window // will contain the number of points
in the window
|count('emitted').as('value')
var data_warn = data_warn_window
|sum('emitted').as('value')
|join(data_warn_size).as('emitted','size')
var data_crit_window = data
|window().period(threshold_crit).every(1u)
var data_crit_size = data_crit_window // will contain the number of points
in the window
|count('emitted').as('value')
var data_crit = data_crit_window
|sum('emitted').as('value')
|join(data_crit_size).as('emitted','size')
data_warn
|join(data_crit).as('warn','crit')
|alert()
.crit(lambda:
("crit.size.value" * interval >= threshold_crit) // make sure we have
a full window to prevent false alerts at start
AND ("crit.emitted.value" == 0)
)
.warn(lambda:
("warn.size.value" * interval >= threshold_warn) // make sure we have
a full window to prevent false alerts at start
AND ("warn.emitted.value" == 0)
)
Unfortunately with this design it's not possible to put in the alert message
how long the host has been unresponsive for, only that it's greater than the
threshold. But at least it works.
-Patrick
On Monday, September 12, 2016 at 5:30:36 PM UTC-4, [email protected] wrote:
> I'm trying to detect when nodes stop reporting data to telegraf. However I
> don't want to use something like `deadman` as I want to have multiple alert
> levels, such as warn and critical. Warn would be if no data has been received
> for >60s, and critical would be >300s (or similar).
>
> The only way I can think of to do this is taking a `stats()|derivative()`
> node, copying it with a `where(lambda: "emitted" > 0)`, and then getting the
> time difference between the last data point with the filter and the last data
> point without the filter. But I can't figure out how to accomplish this.
>
> Any help would be appreciated.
>
> Thanks
>
> -Patrick
--
Remember to include the InfluxDB version number with all issue reports
---
You received this message because you are subscribed to the Google Groups
"InfluxDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit
https://groups.google.com/d/msgid/influxdb/c539925b-7aad-4f6b-ac8d-eea3987c2679%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.