[influxdb] Re: [Kapacitor] Questions about my tick script (joins and other things)

nathaniel Fri, 10 Feb 2017 10:05:30 -0800

Thanks for a detailed question! 

1) Am I right in that this is failing at the join? Or is there 
fundamentally bigger problems


Not quite, the join node has two parent nodes log4 and groupBy8. Neither 
parent has sent any points on to the join node, so the join node has not 
had an opportunity to do anything yet. If you follow the trail back up, the 
window6 node has not emitted any values either. Meaning that not enough 
data has arrived for it to trigger emitting a window. The other window node 
did get enough data to trigger one emit but that was it. 

Looks like you are windowing the data so that you can have the grace period 
you were talking about for new hosts. In that case you will want to 
configure the alert node with `.all()` so that all points in the window 
have to meat the conditions in order to trigger an alert.
If you are not using the window for that purpose then just remove it as its 
not doing anything otherwise.


2) What have I done wrong for this join to be failing? Am I completely 
misunderstanding the join (or even more general), or is there just a small 
implementation issue?

Understanding the join .`on` property here is the key. The way the `.on` 
property work is it expects one of the parents to be grouped by a set of 
specific tags and one of the other parents to be grouped by less specific 
tags.
For example in your case the process data should be grouped by name and 
host while the uptime data is only grouped by host. The resulting data is 
grouped by the more specific set of tags (i..e name and host).  I'll show 
an example below.

3) In order to use the result of the join, am I wrong to name it with a var 
for reuse below? I thought .streamName('..') might do this with out setting 
a var, but I simply get an error that 'process_with_uptime' isn't somethign 
thats in scope.

StreamName is rarely needed, it would become the name of the measurement if 
you were to write it back to InfluxDB or something.  Other than that your 
eval looks correct.

4) Is my overall approach just fundamentally wrong?

 Nope

5) Apart from using joins, whats the correct way to take the result of 1 
stream (or batch) and use it in another?

Join is the answer here.

6) Should i have approached this some totally different way?

Like you said there are a few different ways to go about this.  I think the 
simplest is a slight modification on what you are doing. Instead of 
windowing the data, simply join on the uptime and process streams raw. Then 
use an expression like what you have for critical alert to check the uptime 
which will naturally filter out new hosts.

var process_counts = stream
    |from()
        .measurement('process_count')
        // I am assuming that you want tag name and fully_qualified_role as 
well since you referenced it below in the alert.
        .groupBy('name', 'fully_qualified_role', 'host')
    |log()

var box = stream
    |from()
        .measurement('system')
        // Only group by host here, since that is all the tag info we have.
        .groupBy('host')
    |log()

var process_with_uptime = process_counts
    |join(box)
        .as('process', 'sys')
        .tolerance(15s)
        .on('host')
    |log()
        .prefix('** JOIN')
    |eval(lambda: "process.time" - "sys.uptime")
        .as('boot_time')

process_with_uptime
    |log()
        .level('DEBUG')
        .prefix('** PROCSESS WITH UPTIME')
    |alert()
        .id('{{ .TaskName }}/{{ index .Tags "fully_qualified_role" }}/{{ 
index .Tags "host" }}')
        .message('{{ index .Tags "name" }} has {{index .Fields "count" }} 
 processes running for {{ .ID }}. System has been up for {{ index .Fields 
"sys.uptime" }} seconds and booted at {{index .Fields "boot_time"}}.')
        .info(lambda: "process.count" >= 0)
        .warn(lambda: "process.count" == 0)
        .crit(lambda: ("process.count" == 0) AND ("sys.uptime" > 120)) 
possible alternative
        .victorOps()


On Friday, February 10, 2017 at 3:29:56 AM UTC-7, Glenn Davy wrote:
>
> Hi Peeps
>
> I'm trying to learn to use Kapacitor and hitting a few snags in my 
> understanding, trying to solve this simple problem has surfaced all sorts 
> of questions, and I'm hoping to get some of my misunderstandings sorted out.
>
> I've got a measurement called process_count that shows a count of the 
> number of a given process running by host, and there's a 'system' table 
> which comes from telegraph and is essentaily output of `uptime`.
>
> If that process stops running (process_count goes to 0), I want to be 
> alerted. But when a new box comes up, I want to allow some breathing space 
> before we get alerts.
>
> There's obviously a few ways to solve this  (i've even tried some!) and 
> keen to learn better ways, but I'm running with this as a sample for asking 
> questions
>
> Samples are sent to influx at about 30 second intervales (+/- jitter).
>
> I'm trying to join process_count records onto the system record (1 system 
> record for many process_count records), so that there's an uptime field 
> available when i determine my critical alert.
>
>
> Here's a sample from my process_count table and from my system
>
> ```
> > select count, host, name  from process_count where 
> instance_id='i-0xxxxx3e078a04f20' group by instance_id order by time desc 
> limit 10;
> name: process_count
> tags: instance_id=i-0xxxxx3e078a04f20
> time                           count host                                 
>          name                  
> ----                           ----- ----                                 
>          ----                  
> 2017-02-10T03:25:56.004751872Z 1     
> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes 
> 2017-02-10T03:25:55.984448256Z 6     
> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes 
> 2017-02-10T03:25:25.92088576Z  1     
> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes 
> 2017-02-10T03:25:25.900282368Z 6     
> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes 
> 2017-02-10T03:24:55.834618368Z 1     
> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes 
> 2017-02-10T03:24:55.814406144Z 6     
> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes 
> 2017-02-10T03:24:25.751718144Z 1     
> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes 
> 2017-02-10T03:24:25.7313984Z   6     
> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes 
> 2017-02-10T03:23:55.66639104Z  1     
> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes 
> 2017-02-10T03:23:55.64570112Z  6     
> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes 
>
> > select uptime, host from system  where 
> host='someapp-production-web-i-0xxxxx3e078a04f20' order by time desc limit 
> 5 ;
> name: system
> time                 uptime host
> ----                 ------ ----
> 2017-02-10T03:26:04Z 55399  someapp-production-web-i-0xxxxx3e078a04f20
> 2017-02-10T03:25:30Z 55365  someapp-production-web-i-0xxxxx3e078a04f20
> 2017-02-10T03:25:01Z 55336  someapp-production-web-i-0xxxxx3e078a04f20
> 2017-02-10T03:24:35Z 55309  someapp-production-web-i-0xxxxx3e078a04f20
> 2017-02-10T03:24:01Z 55276  someapp-production-web-i-0xxxxx3e078a04f20
> ```
>
> And here's the tickscript. The problem I seem to be having is nothing is 
>  coming out from the join. I'm not getting any logging out of the .log on 
> the join or the subsequent stream. I'm hoping that the process_counts zip
> to the nearest time (i have a tolerance of 14s) based on host. Also the 
> annotations int he DOT script seem to suggest nothing is processed through 
> these streams.
>
>
> ```
> ID: someapp_production_process_not_running
> Error: 
> Template: 
> Type: stream
> Status: enabled
> Executng: true
> Created: 09 Feb 17 12:20 UTC
> Modified: 10 Feb 17 03:35 UTC
> LastEnabled: 10 Feb 17 03:35 UTC
> Databases Retenton Policies: ["someapp_production"."default"]
> TICKscript:
>
>
> var process_counts = stream
>     |from()
>         .measurement('process_count')
>     |window()
>         .period(10m)
>         .every(30s)
>     |groupBy('host')
>     |log()
>
> var box = stream
>     |from()
>         .measurement('system')
>     |window()
>         .period(10m)
>         .every(30s)
>     |log()
>     |groupBy('host')
>
> var process_with_uptime = process_counts
>     |join(box)
>         .as('process', 'sys')
>         .tolerance(14s)
>         .streamName('process_with_uptime')
>         .on('host')
>     |log()
>         .prefix('** JOIN')
>     |eval(lambda: "process.time" - "sys.uptime")
>         .as('boot_time')
>
> process_with_uptime
>     |log()
>         .level('DEBUG')
>         .prefix('** PROCSESS WITH UPTIME')
>     |alert()
>         .id('{{ .TaskName }}/{{ index .Tags "fully_qualified_role" }}/{{ 
> index .Tags "host" }}')
>         .message('{{ index .Tags "name" }} has {{index .Fields "count" }} 
>  processes running for {{ .ID }}. System has been up for {{ index .Fields 
> "sys.uptime" }} seconds and booted at {{index .Fields "boot_time"}}.')
>         .info(lambda: "count" >= 0)
>         .warn(lambda: "count" == 0)
>         .crit(lambda: ("count" == 0) AND ("sys.uptime" < 120)) possible 
> alternative
>         .victorOps()
>
> DOT:
> digraph someapp_production_process_not_running {
> graph [throughput="0.00 points/s"];
>
> stream0 [avg_exec_time_ns="0" ];
> stream0 -> from5 [processed="23"];
> stream0 -> from1 [processed="23"];
>
> from5 [avg_exec_time_ns="157ns" ];
> from5 -> window6 [processed="6"];
>
> window6 [avg_exec_time_ns="565ns" ];
> window6 -> log7 [processed="0"];
>
> log7 [avg_exec_time_ns="0" ];
> log7 -> groupby8 [processed="0"];
>
> groupby8 [avg_exec_time_ns="0" ];
> groupby8 -> join10 [processed="0"];
>
> from1 [avg_exec_time_ns="452ns" ];
> from1 -> window2 [processed="17"];
>
> window2 [avg_exec_time_ns="1.05µs" ];
> window2 -> groupby3 [processed="1"];
>
> groupby3 [avg_exec_time_ns="0" ];
> groupby3 -> log4 [processed="0"];
>
> log4 [avg_exec_time_ns="0" ];
> log4 -> join10 [processed="0"];
>
> join10 [avg_exec_time_ns="0" ];
> join10 -> log11 [processed="0"];
>
> log11 [avg_exec_time_ns="0" ];
> log11 -> eval12 [processed="0"];
>
> eval12 [avg_exec_time_ns="0" eval_errors="0" ];
> eval12 -> log13 [processed="0"];
>
> log13 [avg_exec_time_ns="0" ];
> log13 -> alert14 [processed="0"];
>
> alert14 [alerts_triggered="0" avg_exec_time_ns="0" crits_triggered="0" 
> infos_triggered="0" oks_triggered="0" warns_triggered="0" ];
> }
> ```
>
> My questions are:
> 1) Am I right in that this is failing at the join? Or is there 
> fundamentally bigger problems
> 2) What have I done wrong for this join to be failing? Am I completely mis 
> understanding the join (or even more general), or is there just a small 
> implementation issue?
> 3) In order to use the result of the join, am I wrong to name it with a 
> var for reuse below? I thought .streamName('..') might do this with out 
> setting a var, but I simply get an error that 'process_with_uptime' isn't 
> somethign thats in scope.
> 4) Is my overall approach just fundamentally wrong? 
> 5) Apart from using joins, whats the correct way to take the result of 1 
> stream (or batch) and use it in another?
> 6) Should i have approached this some totally different way?
>
>

-- 
Remember to include the version number!
--- 
You received this message because you are subscribed to the Google Groups 
"InfluxData" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/influxdb/1f13def6-b2e1-4526-9faa-c5d38b3a976a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[influxdb] Re: [Kapacitor] Questions about my tick script (joins and other things)

Reply via email to