[influxdb] Re: Kapacitor: trouble creating alert when field in certain state for too long

nathaniel Fri, 11 Nov 2016 14:26:39 -0800

FYI, I dug in and found several optimizations that can be made to the union 
node. PR will be incoming.


On Friday, November 11, 2016 at 1:59:20 PM UTC-7, [email protected] 
wrote:
>
> Below is a script that gets closer but still suffers from the 2nd and 3rd 
> issues you mentioned. It does fix the first issue.
>
> I'll file a github issue to see if we can change elapsed to return a 0 if 
> it only receives a single point.
> The third issue about buffering is going to stay. The buffering is coming 
> from the union node. It needs to make sure that it emits points ordered by 
> time and so it buffers points until it knows no more points can arrive. (It 
> did seem to be buffering too many points when I tested it so I'll double 
> check it can't be optimized). 
>
> You have me thinking, I'll be playing around with this some more to make 
> sure I haven't missed something.
>
> var data = batch | query('select active from 
> telegraf.autogen.postgresql_replication_slots')
>       .period(4h)
>       .every(10s)
>       .groupBy('host','slot_name')
>
> var data_last = data
>     |last('active')
>         .as('active')
>         .usePointTimes()
>
> var data_last_active = data
>     |where(lambda: "active" == TRUE)
>     |last('active')
>         .as('active')
>         .usePointTimes()
>
> var data_union = data_last
>     |union(data_last_active)
>     |log()
>         .prefix('UNION STREAM')
>     |window()
>         .period(10s)
>         .every(10s)
>         .align()
>     |log()
>         .prefix('UNION BATCH')
>
> var data_elapsed = data_union
>     |elapsed('active', 1s)
>         .as('elapsed')
>     |last('elapsed')
>         .as('elapsed')
>     |log()
>         .prefix('ELAPSED')
>
> var data_count = data_union
>     |count('active')
>         .as('count')
>     |log()
>         .prefix('COUNT')
>
> data_elapsed
>     |join(data_count)
>         .as('elapsed', 'count')
>         .fill('none')
>     |log()
>         .prefix('JOIN')
>
>
> On Friday, November 11, 2016 at 10:34:06 AM UTC-7, [email protected] 
> wrote:
>>
>> I'm trying to create an alert when a specific field has been in a certain 
>> state for too long.
>> Currently my tick script looks like this:
>>
>>     var data = batch | query('select active from 
>> telegraf.autogen.postgresql_replication_slots')
>>       .period(4h)
>>       .every(10s)
>>       .groupBy('host','slot_name')
>>     
>>     var data_last = data|last('active').as('active')
>>     var data_last_active = data|where(lambda: "active" == 
>> TRUE)|last('active').as('active')
>>     
>>     var data_union = data_last|union(data_last_active)
>>     var data_elapsed = data_union|elapsed('active',1s)|log()
>>     var data_count = data_union|count('active').as('count')|log()
>>     
>> data_elapsed|join(data_count).as('elapsed','count').tolerance(10s).fill('none')|log()
>>
>> The idea is that `data_last` will be the last data point. 
>> `data_last_active` will be the last data point where `active == true`. We 
>> would then calculate the time difference between these 2 points. If the 
>> difference is greater than X, generate an alert.
>> But we also want to handle the case where `data_last_active` is empty (no 
>> match within time period), so we get the count of points, which in this 
>> case would be 1.
>>
>> However there are numerous problems with this:
>> 1. `elapsed()` is including the data points from the previous batch 
>> period, instead of just within the batch. So if batch.period is 60s, then 
>> one of the elapsed values is going to be 60s.
>> 2. `elapsed()` won't emit anything at all if there is no previous data 
>> point, thus breaking the case where `data_last_active` is empty.
>> 3. `count()` is buffering, and doesn't release the data points until the 
>> next batch comes in.
>>
>>
>> Here's an example of what the above generates:
>>     [test:log10] 2016/11/11 12:31:09 I! 
>>  
>> {"Name":"postgresql_replication_slots","Database":"","RetentionPolicy":"","Group":"host=fll2gdbs01qa,slot_name=fll2gbar01stg","Dimensions":{"ByName":false,"TagNames":["host","slot_name"]},"Tags":{"host":"fll2gdbs01qa","slot_name":"fll2gbar01stg"},"Fields":{"count":2},"Time":"2016-11-11T12:30:59.785563762-05:00"}
>>     [test:log10] 2016/11/11 12:31:09 I! 
>>  
>> {"Name":"postgresql_replication_slots","Database":"","RetentionPolicy":"","Group":"host=fll2gdbs01qa,slot_name=fll2gdbs01qa","Dimensions":{"ByName":false,"TagNames":["host","slot_name"]},"Tags":{"host":"fll2gdbs01qa","slot_name":"fll2gdbs01qa"},"Fields":{"count":1},"Time":"2016-11-11T12:30:59.785563762-05:00"}
>>     [test:log8] 2016/11/11 12:31:09 I! 
>>  
>> {"Name":"postgresql_replication_slots","Database":"","RetentionPolicy":"","Group":"host=fll2gdbs01qa,slot_name=fll2gbar01stg","Dimensions":{"ByName":false,"TagNames":["host","slot_name"]},"Tags":{"host":"fll2gdbs01qa","slot_name":"fll2gbar01stg"},"Fields":{"elapsed":10},"Time":"2016-11-11T17:31:09.785572403Z"}
>>     [test:log8] 2016/11/11 12:31:09 I! 
>>  
>> {"Name":"postgresql_replication_slots","Database":"","RetentionPolicy":"","Group":"host=fll2gdbs01qa,slot_name=fll2gbar01stg","Dimensions":{"ByName":false,"TagNames":["host","slot_name"]},"Tags":{"host":"fll2gdbs01qa","slot_name":"fll2gbar01stg"},"Fields":{"elapsed":0},"Time":"2016-11-11T17:31:09.785572403Z"}
>>     [test:log8] 2016/11/11 12:31:09 I! 
>>  
>> {"Name":"postgresql_replication_slots","Database":"","RetentionPolicy":"","Group":"host=fll2gdbs01qa,slot_name=fll2gdbs01qa","Dimensions":{"ByName":false,"TagNames":["host","slot_name"]},"Tags":{"host":"fll2gdbs01qa","slot_name":"fll2gdbs01qa"},"Fields":{"elapsed":10},"Time":"2016-11-11T17:31:09.785572403Z"}
>>
>>
>> Any suggestions to get this working?
>>
>> -Patrick
>>
>>

-- 
Remember to include the version number!
--- 
You received this message because you are subscribed to the Google Groups 
"InfluxData" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/influxdb/09be8854-c51b-4fdd-bd66-1d47b8c328ee%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[influxdb] Re: Kapacitor: trouble creating alert when field in certain state for too long

Reply via email to