This is great work! We'll take a look at those issues to see if we can't help simplify this workflow. Thanks again.
On Wednesday, September 14, 2016 at 12:07:22 PM UTC-6, [email protected] wrote: > > Finally came up with a solution for this: > > var interval = 5s > var threshold_warn = 65s > var threshold_crit = 305s > // Note: the alert will come in between (threshold + interval) and > (threshold + interval * 2). > // This should be improved by > https://github.com/influxdata/kapacitor/issues/898 > > var data = stream > |from().measurement('system').groupBy('host') > |stats(interval).align() > |derivative('emitted').unit(interval).as('emitted') // can't use > difference() because of https://github.com/influxdata/kapacitor/issues/904 > > var data_warn_window = data > |window().period(threshold_warn).every(1u) > var data_warn_size = data_warn_window // will contain the number of > points in the window > |count('emitted').as('value') > var data_warn = data_warn_window > |sum('emitted').as('value') > |join(data_warn_size).as('emitted','size') > > var data_crit_window = data > |window().period(threshold_crit).every(1u) > var data_crit_size = data_crit_window // will contain the number of > points in the window > |count('emitted').as('value') > var data_crit = data_crit_window > |sum('emitted').as('value') > |join(data_crit_size).as('emitted','size') > > data_warn > |join(data_crit).as('warn','crit') > |alert() > .crit(lambda: > ("crit.size.value" * interval >= threshold_crit) // make sure we > have a full window to prevent false alerts at start > AND ("crit.emitted.value" == 0) > ) > .warn(lambda: > ("warn.size.value" * interval >= threshold_warn) // make sure we > have a full window to prevent false alerts at start > AND ("warn.emitted.value" == 0) > ) > > > Unfortunately with this design it's not possible to put in the alert > message how long the host has been unresponsive for, only that it's greater > than the threshold. But at least it works. > > -Patrick > > > On Monday, September 12, 2016 at 5:30:36 PM UTC-4, [email protected] > wrote: > > I'm trying to detect when nodes stop reporting data to telegraf. However > I don't want to use something like `deadman` as I want to have multiple > alert levels, such as warn and critical. Warn would be if no data has been > received for >60s, and critical would be >300s (or similar). > > > > The only way I can think of to do this is taking a > `stats()|derivative()` node, copying it with a `where(lambda: "emitted" > > 0)`, and then getting the time difference between the last data point with > the filter and the last data point without the filter. But I can't figure > out how to accomplish this. > > > > Any help would be appreciated. > > > > Thanks > > > > -Patrick > > -- Remember to include the InfluxDB version number with all issue reports --- You received this message because you are subscribed to the Google Groups "InfluxDB" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/influxdb. To view this discussion on the web visit https://groups.google.com/d/msgid/influxdb/a820a89c-dc4b-4593-95dd-b2b97175ee36%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
