Thanks for the full blown deadman code, it makes me realize what I was doing wrong... I was using multiple script to inject data into my influxdb database and I was using the wrong one so the fact that a critical alert was raised and never resolved was correct, the metric name was close enough for me not to realize it, I really feel stupid now xD
Thanks for your help, is the script above really equivalent to the deadman() call ? Because if that's the cause I think I will keep this one since I can actually understand what it is doing. On Monday, 7 November 2016 18:54:03 UTC+1, [email protected] wrote: > > > The test case is so basic I really don't see what I could be doing > wrong... > > Agreed, what version of Kapacitor are you using? I just manually tested > the deadman with the latest release and its working fine. > > Could you try this TICKscript to helps us get to the bottom of what is > going on? > > var data = stream > |from() > .measurement('invite_delay') > .where(lambda: "host" == 'router' AND "app_name" == 'phone_tester') > > data > // |deadman(1.0, 10s) is equivalent to the below code, with the > exception of the |log statements > |stats(10s) > .align() > |log() > .prefix('DEADMAN RAW STATS') > |derivative('emitted') > .unit(10s) > .nonNegative() > |log() > .prefix('DEADMAN STATS') > |alert() > .id('{{ .TaskName }}/{{ .Name }}') > .crit(lambda: "emitted" <= 1.0) > .stateChangesOnly() > .log('/tmp/dead.log') > > data > |log() > .prefix('RAW DATA') > > > With the added log statements we should be able to determine where the > breakdown is. After running this script can you share the relevant logs? > > Thanks > > > On Monday, November 7, 2016 at 10:18:33 AM UTC-7, Julien Ammous wrote: >> >> I just did another test with 10s instead of 3min to make it easier with >> the same result, here is what I do: >> >> - I insert a point and wait 10s, the alert is correctly raised >> - I insert four points and wait 10s, nothing happen >> >> The kapacitor alert endpoint confirms what I see: >> >> "alert5": { >>> >>> - "alerts_triggered": 1, >>> - "avg_exec_time_ns": 30372, >>> - "collected": 29, >>> - "crits_triggered": 1, >>> - "emitted": 1, >>> - "infos_triggered": 0, >>> - "oks_triggered": 0, >>> - "warns_triggered": 0 >>> >>> }, >>> >> >> 1 critical alert was raised and no ok. >> >> The test case is so basic I really don't see what I could be doing >> wrong... >> >> On 7 November 2016 at 17:28, <[email protected]> wrote: >> >>> To answer your questions: >>> >>> Yes, the deadman should fire an OK alert. And it should do so within the >>> deadman interval of the point arriving. In your case since you are checking >>> on 3m intervals, if a new points arrives it should fire an OK alert within >>> 3m of that point's arrival. >>> >>> As for the sources they are a bit hidden since the deadman function is >>> really just syntactic sugar for a combination of nodes. Primarily deadman >>> uses the stats node under the hood. See >>> https://github.com/influxdata/kapacitor/blob/master/stats.go >>> >>> >>> As for what might be going on in your case I have one idea. The deadman >>> comparison is less than or equal to the threshold. So since you have a >>> threshold of 1 then you have to send at least 2 points in 3m for the OK to >>> be sent. Can you verify that at least 2 points arrived within 3m and you >>> still didn't get an OK alert? >>> >>> >>> On Monday, November 7, 2016 at 2:28:44 AM UTC-7, Julien Ammous wrote: >>>> >>>> Hi, >>>> I want to have an alert raised when no data were received in the last >>>> 3min but I also want the alert to be stopped as soon as new data arrived >>>> again, I have been playing with deadman but I can't figure out how to make >>>> it save an OK state when data arrive again, here is the script: >>>> >>>> stream >>>> |from() >>>> .measurement('invite_delay') >>>> .where(lambda: "host" == 'router' AND "app_name" == 'phone_tester') >>>> |deadman(1.0, 3m) >>>> .id('{{ .TaskName }}/{{ .Name }}') >>>> .stateChangesOnly() >>>> .levelField('level') >>>> .IdField('id') >>>> .DurationField('duration') >>>> |influxDBOut() >>>> .database('metrics') >>>> .measurement('alerts') >>>> .retentionPolicy('raw') >>>> >>>> >>>> I get a CRITICAL alert when data have been missing for 3min, this >>>> works, but if data start flowing again I get nothing, I kept it running >>>> while doing something else and never got any OK for this alert :( >>>> >>>> I tried to find the source for the deadman logic but I couldn't find >>>> it, I have a few questions: >>>> - when data are received again, is the deaman alert supposed to send an >>>> OK state ? >>>> - if it is then when will it send it, will it be as soon as a point >>>> arrive or will there be a delay ?(let's pretend influxdbOut write the >>>> alert >>>> immediately for this question) >>>> >>>> Where is the dedaman logic defined in the sources ? I am not too >>>> familiar with go but I searched for "Deadman" and what came up were just >>>> what looked like structures and their accessors, not that useful. >>>> >>>> -- >>> Remember to include the version number! >>> --- >>> You received this message because you are subscribed to a topic in the >>> Google Groups "InfluxData" group. >>> To unsubscribe from this topic, visit >>> https://groups.google.com/d/topic/influxdb/rUm82LQd9UI/unsubscribe. >>> To unsubscribe from this group and all its topics, send an email to >>> [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/influxdb. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/influxdb/83cc9a04-962e-4eba-9680-8a029c3e111c%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/influxdb/83cc9a04-962e-4eba-9680-8a029c3e111c%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- Remember to include the version number! --- You received this message because you are subscribed to the Google Groups "InfluxData" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/influxdb. To view this discussion on the web visit https://groups.google.com/d/msgid/influxdb/81e95b65-a585-416b-88d6-e6e2f73e2728%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
