It's been a while since I had time to dedicate to this idea - but now I'm part
way through it.
Thanks Darren for the offer to look at what I was doing wrong, querying the
RRD's. I think I've made some progress, and get what I expect now. [Well,
mostly.]
So, when I use the CLI RRD tools - fetch, it returns a header like this:
uptime loss median ping1 ping2 .. ping20
If I look at the matching data it returns, it appears that there's no header
for the first column.
This is the epoch time [seconds since some date long ago] That's good.
Then Uptime. I assume that is the second column/value. IME, it's always "null"
or NaN. That seems good too, though I'm not sure why it's there - but oh, well.
**Loss. I'd have thought this is "packet loss" or how many of the fpings [in my
case] that were returned. But that doesn't seem to be the case.'
Median is the median of something - I'd guess it's the "middle" value. [Not the
"average" but the actual "median" of the RTT's in this sample. That seems fine.
And then the rest of the pings all seem reasonable.
So, out of the columns I really have problems with, it's the "loss" column
that's just not comprehensible.
If I look at a smokeping graph, and by the color values in the graph, I can get
a rough idea how many packets were lost. [At least according to the graph.]
In one sample, in a four minute period I see the graph showing ~25% loss the
first minute, ~50% the next, 10% the next and 0% in final minute. [step=60s.]
However, if I fetch that data from the RRD [in full resolution] using something
like this:
rrdtool fetch /var/lib/smokeping/some.rrd AVERAGE -s -240
...I get a data table like this.
Epoch time uptime loss median ping1
ping2 ... ping20
---
1557359880: nan 3.966667 0.001019 0.001010
0.001010 ...
1557359940: nan 5.300000 0.001021 nan
nan
1557360000: nan 1.733333 0.001024 0.001000
0.001000
1557360060: nan 0.000000 0.001026 0.001000
0.001006
The second minute[1557359940] has four ping samples that return NaN - which I
assume is lost packets.
But that doesn't match the value in the "loss" column - it's 5.3.
And the graph showed ~50% loss - yet the actual samples show 4/20 [4 NaN, and
16 samples with valid values.] or 20%.
The first and third minutes [1557359880 & 1557360000] have millisecond values
for every ping, 1-20 - which seems, to me, to mean there was NO packet loss.
Yet the graph shows ~25% and ~10% respectively.
And even more confusing, yet again, is the loss column - showing 3.96667 and
1.73333 respectively.
Can some one please explain what is really in that third smokeping column
[seemingly labelled "loss"] and how it's calculations are done? And why do the
graphs, the loss column and the ping returned values columns simply not agree
with each other?
I just really need to understand what's going on, because I don't want to write
a plugin that's going to return data/state incorrectly!
TIA
-Greg
So, I know querying the RRD isn't exactly a smokeping problem - but I think
it's an appropriate place to start.
I'm attempting to write a Nagios/OMD plugin.
Yes, there is a smokeping plug-in currently, but the problem I'm trying to
solve is this...
I've had cases where latency or packet loss goes up, consistently, and I'd like
to get alerts.
But I don't want alerts when a single sample gets, say 3% loss, or latency
jumps 30%. But if I measured that over say, 20 minutes, or an hour, or four
hours - well then I could set limits that would be a lot tighter than I would
for a single sample.
For example, if packet loss is greater than 2% for an hour - well we've
probably got a problem. Same with latency. It might go up for someone's
upload/download - but if it climbs 40% for four hours, then it's a problem we
ought to look at.
With the smokeping plugin or Nagios's TCP probe - you can really only look at
the result for a single sample [essentially], not an average.
Thus, you end up setting limits that are far outside of what might actually
constitute a problem, because you might have that happen for a few minutes -
perhaps a few times a day - and you don't want nagios [or smokeping] to alert
on all those instances. So, that means you inevitably miss events that are
important.
So, I'm wanting a smokeping plug-in that you can set it to average the last X
number of minutes/hours/whatever of loss/latency/jitter and generate
warnings/critical events.
So, I need to query the RRD's and pull stats.
Ok, now that I've got you so far [Thanks by the way!] - here's the problem I've
got.
[I'm a terrible coder, I have a short attention span, I am even worse at perl,
and I hate details! So, be patient with me!]
Code snippet: [I stole this off the web somewhere, I don't recall where...]
---
#!/usr/bin/perl -W
#
#
use lib qw( /usr/lib/arm-linux-gnueabihf/perl5/5.20 ../lib/perl );
use RRDs;
use POSIX qw(strftime);
#start_time is the oldest data-point, and end_time is the newest.
my $cur_time = time(); # set current time
my $end_time = $cur_time - 60; # set end time to 1m ago
my $start_time = $end_time - 600; # set start 10m in the past
my $rrd_res = 60;
my $temp_var = "";
#$f_cur_time = ctime($cur_time);
#$f_end_time = ctime($end_time);
#$f_start_time = ctime($start_time);
#$f_end_time = ctime($end_time);
print "CT: $cur_time \n";
print strftime("%m/%d/%Y %H:%M:%S",localtime($cur_time));
print "\n \n";
print "ET: $end_time \n";
print strftime("%m/%d/%Y %H:%M:%S",localtime($end_time));
print "\n \n";
print "ST: $start_time \n";
print strftime("%m/%d/%Y %H:%M:%S",localtime($start_time));
print "\n \n";
#exit;
# fetch average values from the RRD database between start and end time
my ($start,$step,$ds_names,$data) =
RRDs::fetch("/var/lib/smokeping/Some-CPE.rrd", "AVERAGE",
"-r", "$rrd_res", "-s", "$start_time", "-e", "$end_time");
# save fetched values in a 2-dimensional array
my $rows = 0;
my $columns = 0;
my $time_variable = $start;
print "Start: $start : ";
print strftime("%m/%d/%Y %H:%M:%S",localtime($start));
print "\n \n";
print "step: $step \n";
print "start loop \n";
print " --- \n";
foreach $line (@$data) {
$vals[$rows][$columns] = $time_variable;
$temp_var = $time_variable;
print strftime("%m/%d/%Y %H:%M:%S",localtime($temp_var));
print "\n";
$time_variable = $time_variable + $step;
$temp_var = $time_variable;
print strftime("%m/%d/%Y %H:%M:%S",localtime($temp_var));
print "\n";
foreach $val (@$line) {
print " --- \n";
print "row: $rows - col: $columns \n";
print "Val: $val ";
$vals[$rows][++$columns] = $val;
print "VC: $vals[$rows][$columns] \n";
print " --- \n";
}
$rows++;
$columns = 0;
}
exit;
---
I've put in a bunch of print statements so I can try to figure out what's going
on. [You can ignore all that...]
There's also some errors in the for loop, because it parses more rows than
exist in the fetch - but ignore that too. [At least for now. Or you can tell me
why - if you like. I'm pretty sure I'll figure it out.]
But what's interesting [at least right now] is that the first two columns have
issues.
Column one [or the first returned value from every row] appears to always be
null.
And the second always appears to be zero.
[At least in my case, with my RRDs.]
But I'm pretty sure it's the same with any RRD from smokeping.
I may not understand [almost certainly don't] what's going on, but I'd have
expected the values in the columns 3-23 to start at 1 and go through 20. [I do
20 samples in this RRD per row.]
So, can someone explain why the first value [column] is always null, and the
second is always zero? [These are all full resolution samples, no aggregation
has occurred.]
Thanks for anyone who takes a stab at it.
And if you're reading Tobi, I'd be glad for your input and/or thoughts.
Thanks!
-Greg
--
Gregory Sloop, Principal: Sloop Network & Computer Consulting
Voice: 503.251.0452 x82
EMail: gr...@sloop.net
http://www.sloop.net
---
_______________________________________________
smokeping-users mailing list
smokeping-users@lists.oetiker.ch
https://lists.oetiker.ch/cgi-bin/listinfo/smokeping-users