On Friday, February 10, 2017 at 6:11:00 PM UTC-7, Glenn Davy wrote: > > On Saturday, 11 February 2017 07:05:14 UTC+13, [email protected] wrote: > >> Thanks for a detailed question! >> > > Welcome! Thanks for a detailed answer :) > > >> Not quite, the join node has two parent nodes log4 and groupBy8. Neither >> parent has sent any points on to the join node, so the join node has not >> had an opportunity to do anything yet. If you follow the trail back up, the >> window6 node has not emitted any values either. Meaning that not enough >> data has arrived for it to trigger emitting a window. The other window node >> did get enough data to trigger one emit but that was it. >> >> > I don't really understand this, in as much as, whats enough data to > trigger an emit? >
I the period of a window is say 1m then 1 minutes worth of data has to arrive in order for the window to emit. That could be a little as two points more than 1m apart and the window would just contain the first point. > > >> Looks like you are windowing the data so that you can have the grace >> period you were talking about for new hosts. In that case you will want to >> configure the alert node with `.all()` so that all points in the window >> have to meat the conditions in order to trigger an alert. >> If you are not using the window for that purpose then just remove it as >> its not doing anything otherwise. >> >> Nope that wasn't the purpose, it was really just to give me the illusion > of understanding what was happening :D > > So, then what is the purpose of the window? Is it jus a way of saying > confine your processing to whats in this group? So that for example, if I'd > have done a first()/last()/sum()/count()/max()/min()/other() it would have > only applied to what was in the window? or does it have some other use? > > Yes, a window defines how you want to batch up your data to perform aggergations, transformations and selections on the data. > > >> >> 2) What have I done wrong for this join to be failing? Am I completely >> misunderstanding the join (or even more general), or is there just a small >> implementation issue? >> >> Understanding the join .`on` property here is the key. The way the `.on` >> property work is it expects one of the parents to be grouped by a set of >> specific tags and one of the other parents to be grouped by less specific >> tags. >> For example in your case the process data should be grouped by name and >> host while the uptime data is only grouped by host. The resulting data is >> grouped by the more specific set of tags (i..e name and host). I'll show >> an example below. >> >> OK, great thanks! that makes sense, and seems to work now! > > Other than that your eval looks correct. >> >> The eval gives me this error in the logs: > > eval9] 2017/02/11 00:23:45 E! no field or tag exists for process.time > > When I look at the data sent to victor I see this snippet listing the > columns > > > ["time","process.count","sys.load1","sys.load15","sys.load5","sys.n_cpus","sys.n_users","sys.uptime","sys.uptime_format"] > > If I change from "process.time" to "time" (which seems to be the correct > thing to do) I get: > > E! invalid math operator - for type time > > I'm guess this is because I see these values associated with the above > columns when i peek into the victor message: > > [["2017-02-11T00:51:00Z",0,0.05,0.05,0.09,2,1,201," 0:03"]]}] > > I'm guessing the time is maths is choking on that? Whats the handing so > that times are processable inside kapacitor, but get sent out in a readable > format? > What processing are you trying to do on the time value? Some time math is currently implemented but not all. See https://github.com/influxdata/kapacitor/issues/169 > > All that aside, for now, I've remarked out the eval for now as it seemed > to stopped data flowing through, but apart from this eval, everything is > now working as hoped!(tm). > Glad its working! > > Thanks for all your explanations Nathan. > > > >> >> var process_counts = stream >> |from() >> .measurement('process_count') >> // I am assuming that you want tag name and fully_qualified_role >> as well since you referenced it below in the alert. >> .groupBy('name', 'fully_qualified_role', 'host') >> |log() >> >> var box = stream >> |from() >> .measurement('system') >> // Only group by host here, since that is all the tag info we >> have. >> .groupBy('host') >> |log() >> >> var process_with_uptime = process_counts >> |join(box) >> .as('process', 'sys') >> .tolerance(15s) >> .on('host') >> |log() >> .prefix('** JOIN') >> |eval(lambda: "process.time" - "sys.uptime") >> .as('boot_time') >> >> process_with_uptime >> |log() >> .level('DEBUG') >> .prefix('** PROCSESS WITH UPTIME') >> |alert() >> .id('{{ .TaskName }}/{{ index .Tags "fully_qualified_role" }}/{{ >> index .Tags "host" }}') >> .message('{{ index .Tags "name" }} has {{index .Fields "count" }} >> processes running for {{ .ID }}. System has been up for {{ index .Fields >> "sys.uptime" }} seconds and booted at {{index .Fields "boot_time"}}.') >> .info(lambda: "process.count" >= 0) >> .warn(lambda: "process.count" == 0) >> .crit(lambda: ("process.count" == 0) AND ("sys.uptime" > 120)) >> possible alternative >> .victorOps() >> >> >> On Friday, February 10, 2017 at 3:29:56 AM UTC-7, Glenn Davy wrote: >>> >>> Hi Peeps >>> >>> I'm trying to learn to use Kapacitor and hitting a few snags in my >>> understanding, trying to solve this simple problem has surfaced all sorts >>> of questions, and I'm hoping to get some of my misunderstandings sorted out. >>> >>> I've got a measurement called process_count that shows a count of the >>> number of a given process running by host, and there's a 'system' table >>> which comes from telegraph and is essentaily output of `uptime`. >>> >>> If that process stops running (process_count goes to 0), I want to be >>> alerted. But when a new box comes up, I want to allow some breathing space >>> before we get alerts. >>> >>> There's obviously a few ways to solve this (i've even tried some!) and >>> keen to learn better ways, but I'm running with this as a sample for asking >>> questions >>> >>> Samples are sent to influx at about 30 second intervales (+/- jitter). >>> >>> I'm trying to join process_count records onto the system record (1 >>> system record for many process_count records), so that there's an uptime >>> field available when i determine my critical alert. >>> >>> >>> Here's a sample from my process_count table and from my system >>> >>> ``` >>> > select count, host, name from process_count where >>> instance_id='i-0xxxxx3e078a04f20' group by instance_id order by time desc >>> limit 10; >>> name: process_count >>> tags: instance_id=i-0xxxxx3e078a04f20 >>> time count host >>> name >>> ---- ----- ---- >>> ---- >>> 2017-02-10T03:25:56.004751872Z 1 >>> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes >>> 2017-02-10T03:25:55.984448256Z 6 >>> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes >>> 2017-02-10T03:25:25.92088576Z 1 >>> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes >>> 2017-02-10T03:25:25.900282368Z 6 >>> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes >>> 2017-02-10T03:24:55.834618368Z 1 >>> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes >>> 2017-02-10T03:24:55.814406144Z 6 >>> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes >>> 2017-02-10T03:24:25.751718144Z 1 >>> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes >>> 2017-02-10T03:24:25.7313984Z 6 >>> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes >>> 2017-02-10T03:23:55.66639104Z 1 >>> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Master Processes >>> 2017-02-10T03:23:55.64570112Z 6 >>> someapp-production-web-i-0xxxxx3e078a04f20 Unicorn Worker Processes >>> >>> > select uptime, host from system where >>> host='someapp-production-web-i-0xxxxx3e078a04f20' order by time desc limit >>> 5 ; >>> name: system >>> time uptime host >>> ---- ------ ---- >>> 2017-02-10T03:26:04Z 55399 someapp-production-web-i-0xxxxx3e078a04f20 >>> 2017-02-10T03:25:30Z 55365 someapp-production-web-i-0xxxxx3e078a04f20 >>> 2017-02-10T03:25:01Z 55336 someapp-production-web-i-0xxxxx3e078a04f20 >>> 2017-02-10T03:24:35Z 55309 someapp-production-web-i-0xxxxx3e078a04f20 >>> 2017-02-10T03:24:01Z 55276 someapp-production-web-i-0xxxxx3e078a04f20 >>> ``` >>> >>> And here's the tickscript. The problem I seem to be having is nothing is >>> coming out from the join. I'm not getting any logging out of the .log on >>> the join or the subsequent stream. I'm hoping that the process_counts zip >>> to the nearest time (i have a tolerance of 14s) based on host. Also the >>> annotations int he DOT script seem to suggest nothing is processed through >>> these streams. >>> >>> >>> ``` >>> ID: someapp_production_process_not_running >>> Error: >>> Template: >>> Type: stream >>> Status: enabled >>> Executng: true >>> Created: 09 Feb 17 12:20 UTC >>> Modified: 10 Feb 17 03:35 UTC >>> LastEnabled: 10 Feb 17 03:35 UTC >>> Databases Retenton Policies: ["someapp_production"."default"] >>> TICKscript: >>> >>> >>> var process_counts = stream >>> |from() >>> .measurement('process_count') >>> |window() >>> .period(10m) >>> .every(30s) >>> |groupBy('host') >>> |log() >>> >>> var box = stream >>> |from() >>> .measurement('system') >>> |window() >>> .period(10m) >>> .every(30s) >>> |log() >>> |groupBy('host') >>> >>> var process_with_uptime = process_counts >>> |join(box) >>> .as('process', 'sys') >>> .tolerance(14s) >>> .streamName('process_with_uptime') >>> .on('host') >>> |log() >>> .prefix('** JOIN') >>> |eval(lambda: "process.time" - "sys.uptime") >>> .as('boot_time') >>> >>> process_with_uptime >>> |log() >>> .level('DEBUG') >>> .prefix('** PROCSESS WITH UPTIME') >>> |alert() >>> .id('{{ .TaskName }}/{{ index .Tags "fully_qualified_role" }}/{{ >>> index .Tags "host" }}') >>> .message('{{ index .Tags "name" }} has {{index .Fields "count" >>> }} processes running for {{ .ID }}. System has been up for {{ index >>> .Fields "sys.uptime" }} seconds and booted at {{index .Fields >>> "boot_time"}}.') >>> .info(lambda: "count" >= 0) >>> .warn(lambda: "count" == 0) >>> .crit(lambda: ("count" == 0) AND ("sys.uptime" < 120)) possible >>> alternative >>> .victorOps() >>> >>> DOT: >>> digraph someapp_production_process_not_running { >>> graph [throughput="0.00 points/s"]; >>> >>> stream0 [avg_exec_time_ns="0" ]; >>> stream0 -> from5 [processed="23"]; >>> stream0 -> from1 [processed="23"]; >>> >>> from5 [avg_exec_time_ns="157ns" ]; >>> from5 -> window6 [processed="6"]; >>> >>> window6 [avg_exec_time_ns="565ns" ]; >>> window6 -> log7 [processed="0"]; >>> >>> log7 [avg_exec_time_ns="0" ]; >>> log7 -> groupby8 [processed="0"]; >>> >>> groupby8 [avg_exec_time_ns="0" ]; >>> groupby8 -> join10 [processed="0"]; >>> >>> from1 [avg_exec_time_ns="452ns" ]; >>> from1 -> window2 [processed="17"]; >>> >>> window2 [avg_exec_time_ns="1.05µs" ]; >>> window2 -> groupby3 [processed="1"]; >>> >>> groupby3 [avg_exec_time_ns="0" ]; >>> groupby3 -> log4 [processed="0"]; >>> >>> log4 [avg_exec_time_ns="0" ]; >>> log4 -> join10 [processed="0"]; >>> >>> join10 [avg_exec_time_ns="0" ]; >>> join10 -> log11 [processed="0"]; >>> >>> log11 [avg_exec_time_ns="0" ]; >>> log11 -> eval12 [processed="0"]; >>> >>> eval12 [avg_exec_time_ns="0" eval_errors="0" ]; >>> eval12 -> log13 [processed="0"]; >>> >>> log13 [avg_exec_time_ns="0" ]; >>> log13 -> alert14 [processed="0"]; >>> >>> alert14 [alerts_triggered="0" avg_exec_time_ns="0" crits_triggered="0" >>> infos_triggered="0" oks_triggered="0" warns_triggered="0" ]; >>> } >>> ``` >>> >>> My questions are: >>> 1) Am I right in that this is failing at the join? Or is there >>> fundamentally bigger problems >>> 2) What have I done wrong for this join to be failing? Am I completely >>> mis understanding the join (or even more general), or is there just a small >>> implementation issue? >>> 3) In order to use the result of the join, am I wrong to name it with a >>> var for reuse below? I thought .streamName('..') might do this with out >>> setting a var, but I simply get an error that 'process_with_uptime' isn't >>> somethign thats in scope. >>> 4) Is my overall approach just fundamentally wrong? >>> 5) Apart from using joins, whats the correct way to take the result of 1 >>> stream (or batch) and use it in another? >>> 6) Should i have approached this some totally different way? >>> >>> -- Remember to include the version number! --- You received this message because you are subscribed to the Google Groups "InfluxData" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/influxdb. To view this discussion on the web visit https://groups.google.com/d/msgid/influxdb/4ad96f98-b358-435c-87ee-71fe18d5c148%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
