[influxdb] Re: [Kapacitor] Questions about my tick script (joins and other things)

nathaniel Mon, 13 Feb 2017 08:39:03 -0800


On Friday, February 10, 2017 at 6:11:00 PM UTC-7, Glenn Davy wrote:
>
> On Saturday, 11 February 2017 07:05:14 UTC+13, [email protected] wrote:
>
>> Thanks for a detailed question! 
>>
>
> Welcome! Thanks for a detailed answer :)
>
>
>> Not quite, the join node has two parent nodes log4 and groupBy8. Neither 
>> parent has sent any points on to the join node, so the join node has not 
>> had an opportunity to do anything yet. If you follow the trail back up, the 
>> window6 node has not emitted any values either. Meaning that not enough 
>> data has arrived for it to trigger emitting a window. The other window node 
>> did get enough data to trigger one emit but that was it. 
>>
>>
> I don't really understand this, in as much as, whats enough data to 
> trigger an emit?
>


I the period of a window is say 1m then 1 minutes worth of data has to 
arrive in order for the window to emit. That could be a little as two 
points more than 1m apart and the window would just contain the first 
point. 

>  
>
>> Looks like you are windowing the data so that you can have the grace 
>> period you were talking about for new hosts. In that case you will want to 
>> configure the alert node with `.all()` so that all points in the window 
>> have to meat the conditions in order to trigger an alert.
>> If you are not using the window for that purpose then just remove it as 
>> its not doing anything otherwise.
>>
>> Nope that wasn't the purpose, it was really just to give me the illusion 
> of understanding what was happening :D
>
> So, then what is the purpose of the window? Is it jus a way of saying 
> confine your processing to whats in this group? So that for example, if I'd 
> have done a first()/last()/sum()/count()/max()/min()/other() it would have 
> only applied to what was in the window? or does it have some other use?
>
> Yes, a window defines how you want to batch up your data to perform 
aggergations, transformations and selections on the data. 

>  
>
>>
>> 2) What have I done wrong for this join to be failing? Am I completely 
>> misunderstanding the join (or even more general), or is there just a small 
>> implementation issue?
>>
>> Understanding the join .`on` property here is the key. The way the `.on` 
>> property work is it expects one of the parents to be grouped by a set of 
>> specific tags and one of the other parents to be grouped by less specific 
>> tags.
>> For example in your case the process data should be grouped by name and 
>> host while the uptime data is only grouped by host. The resulting data is 
>> grouped by the more specific set of tags (i..e name and host).  I'll show 
>> an example below.
>>
>> OK, great thanks!  that makes sense, and seems to work now!
>
> Other than that your eval looks correct.
>>
>>  The eval gives me this error in the logs:
>
> eval9] 2017/02/11 00:23:45 E! no field or tag exists for process.time
>
> When I look at the data sent to victor I see this snippet listing the 
> columns
>
>
> ["time","process.count","sys.load1","sys.load15","sys.load5","sys.n_cpus","sys.n_users","sys.uptime","sys.uptime_format"]
>
> If I change from "process.time" to "time" (which seems to be the correct 
> thing to do) I get:
>
>  E! invalid math operator - for type time
>
> I'm guess this is because I see these values associated with the above 
> columns when i peek into the victor message:
>
> [["2017-02-11T00:51:00Z",0,0.05,0.05,0.09,2,1,201," 0:03"]]}]
>
> I'm guessing the time is maths is choking on that? Whats the handing so 
> that times are processable inside kapacitor, but get sent out in a readable 
> format?
>

What processing are you trying to do on the time value? Some time math is 
currently implemented but not all. See 
https://github.com/influxdata/kapacitor/issues/169 

>
> All that aside, for now, I've remarked out the eval for now as it seemed 
> to stopped data flowing through, but apart from this eval, everything is 
> now working as hoped!(tm). 
>

Glad its working! 

>
> Thanks for all your explanations Nathan.
>
>
>
>>
>> var process_counts = stream
>>     |from()
>>         .measurement('process_count')
>>         // I am assuming that you want tag name and fully_qualified_role 
>> as well since you referenced it below in the alert.
>>         .groupBy('name', 'fully_qualified_role', 'host')
>>     |log()
>>
>> var box = stream
>>     |from()
>>         .measurement('system')
>>         // Only group by host here, since that is all the tag info we 
>> have.
>>         .groupBy('host')
>>     |log()
>>
>> var process_with_uptime = process_counts
>>     |join(box)
>>         .as('process', 'sys')
>>         .tolerance(15s)
>>         .on('host')
>>     |log()
>>         .prefix('** JOIN')
>>     |eval(lambda: "process.time" - "sys.uptime")
>>         .as('boot_time')
>>
>> process_with_uptime
>>     |log()
>>         .level('DEBUG')
>>         .prefix('** PROCSESS WITH UPTIME')
>>     |alert()
>>         .id('{{ .TaskName }}/{{ index .Tags "fully_qualified_role" }}/{{ 
>> index .Tags "host" }}')
>>         .message('{{ index .Tags "name" }} has {{index .Fields "count" }} 
>>  processes running for {{ .ID }}. System has been up for {{ index .Fields 
>> "sys.uptime" }} seconds and booted at {{index .Fields "boot_time"}}.')
>>         .info(lambda: "process.count" >= 0)
>>         .warn(lambda: "process.count" == 0)
>>         .crit(lambda: ("process.count" == 0) AND ("sys.uptime" > 120)) 
>> possible alternative
>>         .victorOps()
>>
>>
>> On Friday, February 10, 2017 at 3:29:56 AM UTC-7, Glenn Davy wrote:
>>>
>>> Hi Peeps
>>>
>>> I'm trying to learn to use Kapacitor and hitting a few snags in my 
>>> understanding, trying to solve this simple problem has surfaced all sorts 
>>> of questions, and I'm hoping to get some of my misunderstandings sorted out.
>>>
>>> I've got a measurement called process_count that shows a count of the 
>>> number of a given process running by host, and there's a 'system' table 
>>> which comes from telegraph and is essentaily output of `uptime`.
>>>
>>> If that process stops running (process_count goes to 0), I want to be 
>>> alerted. But when a new box comes up, I want to allow some breathing space 
>>> before we get alerts.
>>>
>>> There's obviously a few ways to solve this  (i've even tried some!) and 
>>> keen to learn better ways, but I'm running with this as a sample for asking 
>>> questions
>>>
>>> Samples are sent to influx at about 30 second intervales (+/- jitter).
>>>
>>> I'm trying to join process_count records onto the system record (1 
>>> system record for many process_count records), so that there's an uptime 
>>> field available when i determine my critical alert.
>>>
>>>
>>> Here's a sample from my process_count table and from my system
>>>
>>> ```
>>> > select count, host, name  from process_count where 
>>> instance_id='i-0xxxxx3e078a04f20' group by instance_id order by time desc 
>>> limit 10;
>>> name: process_count
>>> tags: instance_id=i-0xxxxx3e078a04f20
>>> time                           count host                               
>>>            name                  
>>> ----                           ----- ----                               
>>>            ----                  
>>> 2017-02-10T03:25:56.004751872Z 1     
>>> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes 
>>> 2017-02-10T03:25:55.984448256Z 6     
>>> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes 
>>> 2017-02-10T03:25:25.92088576Z  1     
>>> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes 
>>> 2017-02-10T03:25:25.900282368Z 6     
>>> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes 
>>> 2017-02-10T03:24:55.834618368Z 1     
>>> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes 
>>> 2017-02-10T03:24:55.814406144Z 6     
>>> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes 
>>> 2017-02-10T03:24:25.751718144Z 1     
>>> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes 
>>> 2017-02-10T03:24:25.7313984Z   6     
>>> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes 
>>> 2017-02-10T03:23:55.66639104Z  1     
>>> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes 
>>> 2017-02-10T03:23:55.64570112Z  6     
>>> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes 
>>>
>>> > select uptime, host from system  where 
>>> host='someapp-production-web-i-0xxxxx3e078a04f20' order by time desc limit 
>>> 5 ;
>>> name: system
>>> time                 uptime host
>>> ----                 ------ ----
>>> 2017-02-10T03:26:04Z 55399  someapp-production-web-i-0xxxxx3e078a04f20
>>> 2017-02-10T03:25:30Z 55365  someapp-production-web-i-0xxxxx3e078a04f20
>>> 2017-02-10T03:25:01Z 55336  someapp-production-web-i-0xxxxx3e078a04f20
>>> 2017-02-10T03:24:35Z 55309  someapp-production-web-i-0xxxxx3e078a04f20
>>> 2017-02-10T03:24:01Z 55276  someapp-production-web-i-0xxxxx3e078a04f20
>>> ```
>>>
>>> And here's the tickscript. The problem I seem to be having is nothing is 
>>>  coming out from the join. I'm not getting any logging out of the .log on 
>>> the join or the subsequent stream. I'm hoping that the process_counts zip
>>> to the nearest time (i have a tolerance of 14s) based on host. Also the 
>>> annotations int he DOT script seem to suggest nothing is processed through 
>>> these streams.
>>>
>>>
>>> ```
>>> ID: someapp_production_process_not_running
>>> Error: 
>>> Template: 
>>> Type: stream
>>> Status: enabled
>>> Executng: true
>>> Created: 09 Feb 17 12:20 UTC
>>> Modified: 10 Feb 17 03:35 UTC
>>> LastEnabled: 10 Feb 17 03:35 UTC
>>> Databases Retenton Policies: ["someapp_production"."default"]
>>> TICKscript:
>>>
>>>
>>> var process_counts = stream
>>>     |from()
>>>         .measurement('process_count')
>>>     |window()
>>>         .period(10m)
>>>         .every(30s)
>>>     |groupBy('host')
>>>     |log()
>>>
>>> var box = stream
>>>     |from()
>>>         .measurement('system')
>>>     |window()
>>>         .period(10m)
>>>         .every(30s)
>>>     |log()
>>>     |groupBy('host')
>>>
>>> var process_with_uptime = process_counts
>>>     |join(box)
>>>         .as('process', 'sys')
>>>         .tolerance(14s)
>>>         .streamName('process_with_uptime')
>>>         .on('host')
>>>     |log()
>>>         .prefix('** JOIN')
>>>     |eval(lambda: "process.time" - "sys.uptime")
>>>         .as('boot_time')
>>>
>>> process_with_uptime
>>>     |log()
>>>         .level('DEBUG')
>>>         .prefix('** PROCSESS WITH UPTIME')
>>>     |alert()
>>>         .id('{{ .TaskName }}/{{ index .Tags "fully_qualified_role" }}/{{ 
>>> index .Tags "host" }}')
>>>         .message('{{ index .Tags "name" }} has {{index .Fields "count" 
>>> }}  processes running for {{ .ID }}. System has been up for {{ index 
>>> .Fields "sys.uptime" }} seconds and booted at {{index .Fields 
>>> "boot_time"}}.')
>>>         .info(lambda: "count" >= 0)
>>>         .warn(lambda: "count" == 0)
>>>         .crit(lambda: ("count" == 0) AND ("sys.uptime" < 120)) possible 
>>> alternative
>>>         .victorOps()
>>>
>>> DOT:
>>> digraph someapp_production_process_not_running {
>>> graph [throughput="0.00 points/s"];
>>>
>>> stream0 [avg_exec_time_ns="0" ];
>>> stream0 -> from5 [processed="23"];
>>> stream0 -> from1 [processed="23"];
>>>
>>> from5 [avg_exec_time_ns="157ns" ];
>>> from5 -> window6 [processed="6"];
>>>
>>> window6 [avg_exec_time_ns="565ns" ];
>>> window6 -> log7 [processed="0"];
>>>
>>> log7 [avg_exec_time_ns="0" ];
>>> log7 -> groupby8 [processed="0"];
>>>
>>> groupby8 [avg_exec_time_ns="0" ];
>>> groupby8 -> join10 [processed="0"];
>>>
>>> from1 [avg_exec_time_ns="452ns" ];
>>> from1 -> window2 [processed="17"];
>>>
>>> window2 [avg_exec_time_ns="1.05µs" ];
>>> window2 -> groupby3 [processed="1"];
>>>
>>> groupby3 [avg_exec_time_ns="0" ];
>>> groupby3 -> log4 [processed="0"];
>>>
>>> log4 [avg_exec_time_ns="0" ];
>>> log4 -> join10 [processed="0"];
>>>
>>> join10 [avg_exec_time_ns="0" ];
>>> join10 -> log11 [processed="0"];
>>>
>>> log11 [avg_exec_time_ns="0" ];
>>> log11 -> eval12 [processed="0"];
>>>
>>> eval12 [avg_exec_time_ns="0" eval_errors="0" ];
>>> eval12 -> log13 [processed="0"];
>>>
>>> log13 [avg_exec_time_ns="0" ];
>>> log13 -> alert14 [processed="0"];
>>>
>>> alert14 [alerts_triggered="0" avg_exec_time_ns="0" crits_triggered="0" 
>>> infos_triggered="0" oks_triggered="0" warns_triggered="0" ];
>>> }
>>> ```
>>>
>>> My questions are:
>>> 1) Am I right in that this is failing at the join? Or is there 
>>> fundamentally bigger problems
>>> 2) What have I done wrong for this join to be failing? Am I completely 
>>> mis understanding the join (or even more general), or is there just a small 
>>> implementation issue?
>>> 3) In order to use the result of the join, am I wrong to name it with a 
>>> var for reuse below? I thought .streamName('..') might do this with out 
>>> setting a var, but I simply get an error that 'process_with_uptime' isn't 
>>> somethign thats in scope.
>>> 4) Is my overall approach just fundamentally wrong? 
>>> 5) Apart from using joins, whats the correct way to take the result of 1 
>>> stream (or batch) and use it in another?
>>> 6) Should i have approached this some totally different way?
>>>
>>>

-- 
Remember to include the version number!
--- 
You received this message because you are subscribed to the Google Groups 
"InfluxData" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/influxdb/4ad96f98-b358-435c-87ee-71fe18d5c148%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[influxdb] Re: [Kapacitor] Questions about my tick script (joins and other things)

Reply via email to