Re: [influxdb] Re: deadman never triggers OK state

Julien Ammous Wed, 09 Nov 2016 02:37:38 -0800

Thanks, for some reasons I thought the code shown in the documentation was 
just a way to explain what it was doing instead of the real implementation 
xD
It would be nice to be able to define our own "macros" to make writing ou 
scripts quicker, that's where I wished you had used an existing embeddable 
language instead of writing your own.


On Tuesday, 8 November 2016 16:56:14 UTC+1, [email protected] wrote:
>
> Glad you figured it out.
>
> Yes, the deadman is literally the same thing as that code, you can think 
> of deadman kind of like a macro. If you ever need to look up that code 
> again it can be found in the docs here 
> https://docs.influxdata.com/kapacitor/v1.1/nodes/from_node/#deadman
>
>
>
> On Tuesday, November 8, 2016 at 5:10:54 AM UTC-7, Julien Ammous wrote:
>>
>> Thanks for the full blown deadman code, it makes me realize what I was 
>> doing wrong...
>> I was using multiple script to inject data into my influxdb database and 
>> I was using the wrong one so the fact that a critical alert was raised and 
>> never resolved was correct, the metric name was close enough for me not to 
>> realize it, I really feel stupid now xD
>>
>> Thanks for your help, is the script above really equivalent to the 
>> deadman() call ? Because if that's the cause I think I will keep this one 
>> since I can actually understand what it is doing.
>>
>> On Monday, 7 November 2016 18:54:03 UTC+1, [email protected] wrote:
>>>
>>> > The test case is so basic I really don't see what I could be doing 
>>> wrong...
>>>
>>> Agreed, what version of Kapacitor are you using? I just manually tested 
>>> the deadman with the latest release and its working fine. 
>>>
>>> Could you try this TICKscript to helps us get to the bottom of what is 
>>> going on?
>>>
>>> var data = stream
>>>     |from()
>>>         .measurement('invite_delay')
>>>         .where(lambda: "host" == 'router' AND "app_name" == 
>>> 'phone_tester')
>>>
>>> data
>>>     // |deadman(1.0, 10s) is equivalent to the below code, with the 
>>> exception of the |log statements
>>>     |stats(10s)
>>>         .align()
>>>     |log()
>>>         .prefix('DEADMAN RAW STATS')
>>>     |derivative('emitted')
>>>         .unit(10s)
>>>         .nonNegative()
>>>     |log()
>>>         .prefix('DEADMAN STATS')
>>>     |alert()
>>>         .id('{{ .TaskName }}/{{ .Name }}')
>>>         .crit(lambda: "emitted" <= 1.0)
>>>         .stateChangesOnly()
>>>         .log('/tmp/dead.log')
>>>
>>> data
>>>     |log()
>>>         .prefix('RAW DATA')
>>>
>>>
>>> With the added log statements we should be able to determine where the 
>>> breakdown is. After running this script can you share the relevant logs?
>>>
>>>  Thanks
>>>
>>>
>>> On Monday, November 7, 2016 at 10:18:33 AM UTC-7, Julien Ammous wrote:
>>>>
>>>> I just did another test with 10s instead of 3min to make it easier with 
>>>> the same result, here is what I do:
>>>>
>>>> - I insert a point and wait 10s, the alert is correctly raised
>>>> - I insert four points and wait 10s, nothing happen
>>>>
>>>> The kapacitor alert endpoint confirms what I see:
>>>>
>>>> "alert5": {
>>>>>    
>>>>>    - "alerts_triggered": 1,
>>>>>    - "avg_exec_time_ns": 30372,
>>>>>    - "collected": 29,
>>>>>    - "crits_triggered": 1,
>>>>>    - "emitted": 1,
>>>>>    - "infos_triggered": 0,
>>>>>    - "oks_triggered": 0,
>>>>>    - "warns_triggered": 0
>>>>>
>>>>> },
>>>>>
>>>>
>>>> 1 critical alert was raised and no ok.
>>>>
>>>> The test case is so basic I really don't see what I could be doing 
>>>> wrong...
>>>>
>>>> On 7 November 2016 at 17:28, <[email protected]> wrote:
>>>>
>>>>> To answer your questions:
>>>>>
>>>>> Yes, the deadman should fire an OK alert. And it should do so within 
>>>>> the deadman interval of the point arriving. In your case since you are 
>>>>> checking on 3m intervals, if a new points arrives it should fire an OK 
>>>>> alert within 3m of that point's arrival.
>>>>>
>>>>> As for the sources they are a bit hidden since the deadman function is 
>>>>> really just syntactic sugar for a combination of nodes. Primarily deadman 
>>>>> uses the stats node under the hood. See 
>>>>> https://github.com/influxdata/kapacitor/blob/master/stats.go 
>>>>>
>>>>>
>>>>> As for what might be going on in your case I have one idea. The 
>>>>> deadman comparison is less than or equal to the threshold. So since you 
>>>>> have a threshold of 1 then you have to send at least 2 points in 3m for 
>>>>> the 
>>>>> OK to be sent. Can you verify that at least 2 points arrived within 3m 
>>>>> and 
>>>>> you still didn't get an OK alert?
>>>>>
>>>>>
>>>>> On Monday, November 7, 2016 at 2:28:44 AM UTC-7, Julien Ammous wrote:
>>>>>>
>>>>>> Hi,
>>>>>> I want to have an alert raised when no data were received in the last 
>>>>>> 3min but I also want the alert to be stopped as soon as new data arrived 
>>>>>> again, I have been playing with deadman but I can't figure out how to 
>>>>>> make 
>>>>>> it save an OK state when data arrive again, here is the script:
>>>>>>
>>>>>> stream
>>>>>> |from()
>>>>>>   .measurement('invite_delay')
>>>>>>   .where(lambda: "host" == 'router' AND "app_name" == 'phone_tester')
>>>>>> |deadman(1.0, 3m)
>>>>>>   .id('{{ .TaskName }}/{{ .Name }}')
>>>>>>   .stateChangesOnly()
>>>>>>   .levelField('level')
>>>>>>   .IdField('id')
>>>>>>   .DurationField('duration')
>>>>>> |influxDBOut()
>>>>>>   .database('metrics')
>>>>>>   .measurement('alerts')
>>>>>>   .retentionPolicy('raw')
>>>>>>
>>>>>>
>>>>>> I get a CRITICAL alert when data have been missing for 3min, this 
>>>>>> works, but if data start flowing again I get nothing, I kept it running 
>>>>>> while doing something else and never got any OK for this alert :(
>>>>>>
>>>>>> I tried to find the source for the deadman logic but I couldn't find 
>>>>>> it, I have a few questions:
>>>>>> - when data are received again, is the deaman alert supposed to send 
>>>>>> an OK state ?
>>>>>> - if it is then when will it send it, will  it be as soon as a point 
>>>>>> arrive or will there be a delay ?(let's pretend influxdbOut write the 
>>>>>> alert 
>>>>>> immediately for this question)
>>>>>>
>>>>>> Where is the dedaman logic defined in the sources ? I am not too 
>>>>>> familiar with go but I searched for "Deadman" and what came up were just 
>>>>>> what looked like structures and their accessors, not that useful.
>>>>>>
>>>>>> -- 
>>>>> Remember to include the version number!
>>>>> --- 
>>>>> You received this message because you are subscribed to a topic in the 
>>>>> Google Groups "InfluxData" group.
>>>>> To unsubscribe from this topic, visit 
>>>>> https://groups.google.com/d/topic/influxdb/rUm82LQd9UI/unsubscribe.
>>>>> To unsubscribe from this group and all its topics, send an email to 
>>>>> [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at https://groups.google.com/group/influxdb.
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/influxdb/83cc9a04-962e-4eba-9680-8a029c3e111c%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/influxdb/83cc9a04-962e-4eba-9680-8a029c3e111c%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>

-- 
Remember to include the version number!
--- 
You received this message because you are subscribed to the Google Groups 
"InfluxData" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/influxdb/44e65268-1835-4a49-aeff-aa00d4835cd1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [influxdb] Re: deadman never triggers OK state

Reply via email to