Thanks for the full blown deadman code, it makes me realize what I was 
doing wrong...
I was using multiple script to inject data into my influxdb database and I 
was using the wrong one so the fact that a critical alert was raised and 
never resolved was correct, the metric name was close enough for me not to 
realize it, I really feel stupid now xD

Thanks for your help, is the script above really equivalent to the 
deadman() call ? Because if that's the cause I think I will keep this one 
since I can actually understand what it is doing.

On Monday, 7 November 2016 18:54:03 UTC+1, [email protected] wrote:
>
> > The test case is so basic I really don't see what I could be doing 
> wrong...
>
> Agreed, what version of Kapacitor are you using? I just manually tested 
> the deadman with the latest release and its working fine. 
>
> Could you try this TICKscript to helps us get to the bottom of what is 
> going on?
>
> var data = stream
>     |from()
>         .measurement('invite_delay')
>         .where(lambda: "host" == 'router' AND "app_name" == 'phone_tester')
>
> data
>     // |deadman(1.0, 10s) is equivalent to the below code, with the 
> exception of the |log statements
>     |stats(10s)
>         .align()
>     |log()
>         .prefix('DEADMAN RAW STATS')
>     |derivative('emitted')
>         .unit(10s)
>         .nonNegative()
>     |log()
>         .prefix('DEADMAN STATS')
>     |alert()
>         .id('{{ .TaskName }}/{{ .Name }}')
>         .crit(lambda: "emitted" <= 1.0)
>         .stateChangesOnly()
>         .log('/tmp/dead.log')
>
> data
>     |log()
>         .prefix('RAW DATA')
>
>
> With the added log statements we should be able to determine where the 
> breakdown is. After running this script can you share the relevant logs?
>
>  Thanks
>
>
> On Monday, November 7, 2016 at 10:18:33 AM UTC-7, Julien Ammous wrote:
>>
>> I just did another test with 10s instead of 3min to make it easier with 
>> the same result, here is what I do:
>>
>> - I insert a point and wait 10s, the alert is correctly raised
>> - I insert four points and wait 10s, nothing happen
>>
>> The kapacitor alert endpoint confirms what I see:
>>
>> "alert5": {
>>>    
>>>    - "alerts_triggered": 1,
>>>    - "avg_exec_time_ns": 30372,
>>>    - "collected": 29,
>>>    - "crits_triggered": 1,
>>>    - "emitted": 1,
>>>    - "infos_triggered": 0,
>>>    - "oks_triggered": 0,
>>>    - "warns_triggered": 0
>>>
>>> },
>>>
>>
>> 1 critical alert was raised and no ok.
>>
>> The test case is so basic I really don't see what I could be doing 
>> wrong...
>>
>> On 7 November 2016 at 17:28, <[email protected]> wrote:
>>
>>> To answer your questions:
>>>
>>> Yes, the deadman should fire an OK alert. And it should do so within the 
>>> deadman interval of the point arriving. In your case since you are checking 
>>> on 3m intervals, if a new points arrives it should fire an OK alert within 
>>> 3m of that point's arrival.
>>>
>>> As for the sources they are a bit hidden since the deadman function is 
>>> really just syntactic sugar for a combination of nodes. Primarily deadman 
>>> uses the stats node under the hood. See 
>>> https://github.com/influxdata/kapacitor/blob/master/stats.go 
>>>
>>>
>>> As for what might be going on in your case I have one idea. The deadman 
>>> comparison is less than or equal to the threshold. So since you have a 
>>> threshold of 1 then you have to send at least 2 points in 3m for the OK to 
>>> be sent. Can you verify that at least 2 points arrived within 3m and you 
>>> still didn't get an OK alert?
>>>
>>>
>>> On Monday, November 7, 2016 at 2:28:44 AM UTC-7, Julien Ammous wrote:
>>>>
>>>> Hi,
>>>> I want to have an alert raised when no data were received in the last 
>>>> 3min but I also want the alert to be stopped as soon as new data arrived 
>>>> again, I have been playing with deadman but I can't figure out how to make 
>>>> it save an OK state when data arrive again, here is the script:
>>>>
>>>> stream
>>>> |from()
>>>>   .measurement('invite_delay')
>>>>   .where(lambda: "host" == 'router' AND "app_name" == 'phone_tester')
>>>> |deadman(1.0, 3m)
>>>>   .id('{{ .TaskName }}/{{ .Name }}')
>>>>   .stateChangesOnly()
>>>>   .levelField('level')
>>>>   .IdField('id')
>>>>   .DurationField('duration')
>>>> |influxDBOut()
>>>>   .database('metrics')
>>>>   .measurement('alerts')
>>>>   .retentionPolicy('raw')
>>>>
>>>>
>>>> I get a CRITICAL alert when data have been missing for 3min, this 
>>>> works, but if data start flowing again I get nothing, I kept it running 
>>>> while doing something else and never got any OK for this alert :(
>>>>
>>>> I tried to find the source for the deadman logic but I couldn't find 
>>>> it, I have a few questions:
>>>> - when data are received again, is the deaman alert supposed to send an 
>>>> OK state ?
>>>> - if it is then when will it send it, will  it be as soon as a point 
>>>> arrive or will there be a delay ?(let's pretend influxdbOut write the 
>>>> alert 
>>>> immediately for this question)
>>>>
>>>> Where is the dedaman logic defined in the sources ? I am not too 
>>>> familiar with go but I searched for "Deadman" and what came up were just 
>>>> what looked like structures and their accessors, not that useful.
>>>>
>>>> -- 
>>> Remember to include the version number!
>>> --- 
>>> You received this message because you are subscribed to a topic in the 
>>> Google Groups "InfluxData" group.
>>> To unsubscribe from this topic, visit 
>>> https://groups.google.com/d/topic/influxdb/rUm82LQd9UI/unsubscribe.
>>> To unsubscribe from this group and all its topics, send an email to 
>>> [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/influxdb.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/influxdb/83cc9a04-962e-4eba-9680-8a029c3e111c%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/influxdb/83cc9a04-962e-4eba-9680-8a029c3e111c%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 
Remember to include the version number!
--- 
You received this message because you are subscribed to the Google Groups 
"InfluxData" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/influxdb/81e95b65-a585-416b-88d6-e6e2f73e2728%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to