Thanks for a detailed question!
1) Am I right in that this is failing at the join? Or is there
fundamentally bigger problems
Not quite, the join node has two parent nodes log4 and groupBy8. Neither
parent has sent any points on to the join node, so the join node has not
had an opportunity to do anything yet. If you follow the trail back up, the
window6 node has not emitted any values either. Meaning that not enough
data has arrived for it to trigger emitting a window. The other window node
did get enough data to trigger one emit but that was it.
Looks like you are windowing the data so that you can have the grace period
you were talking about for new hosts. In that case you will want to
configure the alert node with `.all()` so that all points in the window
have to meat the conditions in order to trigger an alert.
If you are not using the window for that purpose then just remove it as its
not doing anything otherwise.
2) What have I done wrong for this join to be failing? Am I completely
misunderstanding the join (or even more general), or is there just a small
implementation issue?
Understanding the join .`on` property here is the key. The way the `.on`
property work is it expects one of the parents to be grouped by a set of
specific tags and one of the other parents to be grouped by less specific
tags.
For example in your case the process data should be grouped by name and
host while the uptime data is only grouped by host. The resulting data is
grouped by the more specific set of tags (i..e name and host). I'll show
an example below.
3) In order to use the result of the join, am I wrong to name it with a var
for reuse below? I thought .streamName('..') might do this with out setting
a var, but I simply get an error that 'process_with_uptime' isn't somethign
thats in scope.
StreamName is rarely needed, it would become the name of the measurement if
you were to write it back to InfluxDB or something. Other than that your
eval looks correct.
4) Is my overall approach just fundamentally wrong?
Nope
5) Apart from using joins, whats the correct way to take the result of 1
stream (or batch) and use it in another?
Join is the answer here.
6) Should i have approached this some totally different way?
Like you said there are a few different ways to go about this. I think the
simplest is a slight modification on what you are doing. Instead of
windowing the data, simply join on the uptime and process streams raw. Then
use an expression like what you have for critical alert to check the uptime
which will naturally filter out new hosts.
var process_counts = stream
|from()
.measurement('process_count')
// I am assuming that you want tag name and fully_qualified_role as
well since you referenced it below in the alert.
.groupBy('name', 'fully_qualified_role', 'host')
|log()
var box = stream
|from()
.measurement('system')
// Only group by host here, since that is all the tag info we have.
.groupBy('host')
|log()
var process_with_uptime = process_counts
|join(box)
.as('process', 'sys')
.tolerance(15s)
.on('host')
|log()
.prefix('** JOIN')
|eval(lambda: "process.time" - "sys.uptime")
.as('boot_time')
process_with_uptime
|log()
.level('DEBUG')
.prefix('** PROCSESS WITH UPTIME')
|alert()
.id('{{ .TaskName }}/{{ index .Tags "fully_qualified_role" }}/{{
index .Tags "host" }}')
.message('{{ index .Tags "name" }} has {{index .Fields "count" }}
processes running for {{ .ID }}. System has been up for {{ index .Fields
"sys.uptime" }} seconds and booted at {{index .Fields "boot_time"}}.')
.info(lambda: "process.count" >= 0)
.warn(lambda: "process.count" == 0)
.crit(lambda: ("process.count" == 0) AND ("sys.uptime" > 120))
possible alternative
.victorOps()
On Friday, February 10, 2017 at 3:29:56 AM UTC-7, Glenn Davy wrote:
>
> Hi Peeps
>
> I'm trying to learn to use Kapacitor and hitting a few snags in my
> understanding, trying to solve this simple problem has surfaced all sorts
> of questions, and I'm hoping to get some of my misunderstandings sorted out.
>
> I've got a measurement called process_count that shows a count of the
> number of a given process running by host, and there's a 'system' table
> which comes from telegraph and is essentaily output of `uptime`.
>
> If that process stops running (process_count goes to 0), I want to be
> alerted. But when a new box comes up, I want to allow some breathing space
> before we get alerts.
>
> There's obviously a few ways to solve this (i've even tried some!) and
> keen to learn better ways, but I'm running with this as a sample for asking
> questions
>
> Samples are sent to influx at about 30 second intervales (+/- jitter).
>
> I'm trying to join process_count records onto the system record (1 system
> record for many process_count records), so that there's an uptime field
> available when i determine my critical alert.
>
>
> Here's a sample from my process_count table and from my system
>
> ```
> > select count, host, name from process_count where
> instance_id='i-0xxxxx3e078a04f20' group by instance_id order by time desc
> limit 10;
> name: process_count
> tags: instance_id=i-0xxxxx3e078a04f20
> time count host
> name
> ---- ----- ----
> ----
> 2017-02-10T03:25:56.004751872Z 1
> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes
> 2017-02-10T03:25:55.984448256Z 6
> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes
> 2017-02-10T03:25:25.92088576Z 1
> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes
> 2017-02-10T03:25:25.900282368Z 6
> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes
> 2017-02-10T03:24:55.834618368Z 1
> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes
> 2017-02-10T03:24:55.814406144Z 6
> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes
> 2017-02-10T03:24:25.751718144Z 1
> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes
> 2017-02-10T03:24:25.7313984Z 6
> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes
> 2017-02-10T03:23:55.66639104Z 1
> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes
> 2017-02-10T03:23:55.64570112Z 6
> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes
>
> > select uptime, host from system where
> host='someapp-production-web-i-0xxxxx3e078a04f20' order by time desc limit
> 5 ;
> name: system
> time uptime host
> ---- ------ ----
> 2017-02-10T03:26:04Z 55399 someapp-production-web-i-0xxxxx3e078a04f20
> 2017-02-10T03:25:30Z 55365 someapp-production-web-i-0xxxxx3e078a04f20
> 2017-02-10T03:25:01Z 55336 someapp-production-web-i-0xxxxx3e078a04f20
> 2017-02-10T03:24:35Z 55309 someapp-production-web-i-0xxxxx3e078a04f20
> 2017-02-10T03:24:01Z 55276 someapp-production-web-i-0xxxxx3e078a04f20
> ```
>
> And here's the tickscript. The problem I seem to be having is nothing is
> coming out from the join. I'm not getting any logging out of the .log on
> the join or the subsequent stream. I'm hoping that the process_counts zip
> to the nearest time (i have a tolerance of 14s) based on host. Also the
> annotations int he DOT script seem to suggest nothing is processed through
> these streams.
>
>
> ```
> ID: someapp_production_process_not_running
> Error:
> Template:
> Type: stream
> Status: enabled
> Executng: true
> Created: 09 Feb 17 12:20 UTC
> Modified: 10 Feb 17 03:35 UTC
> LastEnabled: 10 Feb 17 03:35 UTC
> Databases Retenton Policies: ["someapp_production"."default"]
> TICKscript:
>
>
> var process_counts = stream
> |from()
> .measurement('process_count')
> |window()
> .period(10m)
> .every(30s)
> |groupBy('host')
> |log()
>
> var box = stream
> |from()
> .measurement('system')
> |window()
> .period(10m)
> .every(30s)
> |log()
> |groupBy('host')
>
> var process_with_uptime = process_counts
> |join(box)
> .as('process', 'sys')
> .tolerance(14s)
> .streamName('process_with_uptime')
> .on('host')
> |log()
> .prefix('** JOIN')
> |eval(lambda: "process.time" - "sys.uptime")
> .as('boot_time')
>
> process_with_uptime
> |log()
> .level('DEBUG')
> .prefix('** PROCSESS WITH UPTIME')
> |alert()
> .id('{{ .TaskName }}/{{ index .Tags "fully_qualified_role" }}/{{
> index .Tags "host" }}')
> .message('{{ index .Tags "name" }} has {{index .Fields "count" }}
> processes running for {{ .ID }}. System has been up for {{ index .Fields
> "sys.uptime" }} seconds and booted at {{index .Fields "boot_time"}}.')
> .info(lambda: "count" >= 0)
> .warn(lambda: "count" == 0)
> .crit(lambda: ("count" == 0) AND ("sys.uptime" < 120)) possible
> alternative
> .victorOps()
>
> DOT:
> digraph someapp_production_process_not_running {
> graph [throughput="0.00 points/s"];
>
> stream0 [avg_exec_time_ns="0" ];
> stream0 -> from5 [processed="23"];
> stream0 -> from1 [processed="23"];
>
> from5 [avg_exec_time_ns="157ns" ];
> from5 -> window6 [processed="6"];
>
> window6 [avg_exec_time_ns="565ns" ];
> window6 -> log7 [processed="0"];
>
> log7 [avg_exec_time_ns="0" ];
> log7 -> groupby8 [processed="0"];
>
> groupby8 [avg_exec_time_ns="0" ];
> groupby8 -> join10 [processed="0"];
>
> from1 [avg_exec_time_ns="452ns" ];
> from1 -> window2 [processed="17"];
>
> window2 [avg_exec_time_ns="1.05µs" ];
> window2 -> groupby3 [processed="1"];
>
> groupby3 [avg_exec_time_ns="0" ];
> groupby3 -> log4 [processed="0"];
>
> log4 [avg_exec_time_ns="0" ];
> log4 -> join10 [processed="0"];
>
> join10 [avg_exec_time_ns="0" ];
> join10 -> log11 [processed="0"];
>
> log11 [avg_exec_time_ns="0" ];
> log11 -> eval12 [processed="0"];
>
> eval12 [avg_exec_time_ns="0" eval_errors="0" ];
> eval12 -> log13 [processed="0"];
>
> log13 [avg_exec_time_ns="0" ];
> log13 -> alert14 [processed="0"];
>
> alert14 [alerts_triggered="0" avg_exec_time_ns="0" crits_triggered="0"
> infos_triggered="0" oks_triggered="0" warns_triggered="0" ];
> }
> ```
>
> My questions are:
> 1) Am I right in that this is failing at the join? Or is there
> fundamentally bigger problems
> 2) What have I done wrong for this join to be failing? Am I completely mis
> understanding the join (or even more general), or is there just a small
> implementation issue?
> 3) In order to use the result of the join, am I wrong to name it with a
> var for reuse below? I thought .streamName('..') might do this with out
> setting a var, but I simply get an error that 'process_with_uptime' isn't
> somethign thats in scope.
> 4) Is my overall approach just fundamentally wrong?
> 5) Apart from using joins, whats the correct way to take the result of 1
> stream (or batch) and use it in another?
> 6) Should i have approached this some totally different way?
>
>
--
Remember to include the version number!
---
You received this message because you are subscribed to the Google Groups
"InfluxData" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit
https://groups.google.com/d/msgid/influxdb/1f13def6-b2e1-4526-9faa-c5d38b3a976a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.