It's been a while since I had time to dedicate to this idea - but now I'm part 
way through it.
Thanks Darren for the offer to look at what I was doing wrong, querying the 
RRD's. I think I've made some progress, and get what I expect now. [Well, 

So, when I use the CLI RRD tools - fetch, it returns a header like this:
uptime loss median ping1 ping2 .. ping20

If I look at the matching data it returns, it appears that there's no header 
for the first column.

This is the epoch time [seconds since some date long ago] That's good.
Then Uptime. I assume that is the second column/value. IME, it's always "null" 
or NaN. That seems good too, though I'm not sure why it's there - but oh, well.

**Loss. I'd have thought this is "packet loss" or how many of the fpings [in my 
case] that were returned. But that doesn't seem to be the case.'
Median is the median of something - I'd guess it's the "middle" value. [Not the 
"average" but the actual "median" of the RTT's in this sample. That seems fine.
And then the rest of the pings all seem reasonable.

So, out of the columns I really have problems with, it's the "loss" column 
that's just not comprehensible.

If I look at a smokeping graph, and by the color values in the graph, I can get 
a rough idea how many packets were lost. [At least according to the graph.]
In one sample, in a four minute period I see the graph showing ~25% loss the 
first minute, ~50% the next, 10% the next and 0% in final minute. [step=60s.]

However, if I fetch that data from the RRD [in full resolution] using something 
like this:

rrdtool fetch /var/lib/smokeping/some.rrd AVERAGE -s -240 
...I get a data table like this.

Epoch time        uptime        loss                median                ping1 
               ping2 ... ping20
1557359880:        nan        3.966667        0.001019        0.001010        
0.001010 ... 
1557359940:        nan        5.300000        0.001021        nan               
1557360000:        nan        1.733333        0.001024        0.001000        
1557360060:        nan        0.000000        0.001026        0.001000        

The second minute[1557359940] has four ping samples that return NaN - which I 
assume is lost packets.
But that doesn't match the value in the "loss" column - it's 5.3.
And the graph showed ~50% loss - yet the actual samples show 4/20 [4 NaN, and 
16 samples with valid values.] or 20%.

The first and third minutes [1557359880 & 1557360000] have millisecond values 
for every ping, 1-20 - which seems, to me, to mean there was NO packet loss.
Yet the graph shows ~25% and ~10% respectively.
And even more confusing, yet again, is the loss column - showing 3.96667 and 
1.73333 respectively.

Can some one please explain what is really in that third smokeping column 
[seemingly labelled "loss"] and how it's calculations are done? And why do the 
graphs, the loss column and the ping returned values columns simply not agree 
with each other?

I just really need to understand what's going on, because I don't want to write 
a plugin that's going to return data/state incorrectly!


So, I know querying the RRD isn't exactly a smokeping problem - but I think 
it's an appropriate place to start.

I'm attempting to write a Nagios/OMD plugin.
Yes, there is a smokeping plug-in currently, but the problem I'm trying to 
solve is this...

I've had cases where latency or packet loss goes up, consistently, and I'd like 
to get alerts.
But I don't want alerts when a single sample gets, say 3% loss, or latency 
jumps 30%. But if I measured that over say, 20 minutes, or an hour, or four 
hours - well then I could set limits that would be a lot tighter than I would 
for a single sample.

For example, if packet loss is greater than 2% for an hour - well we've 
probably got a problem. Same with latency. It might go up for someone's 
upload/download - but if it climbs 40% for four hours, then it's a problem we 
ought to look at.

With the smokeping plugin or Nagios's TCP probe - you can really only look at 
the result for a single sample [essentially], not an average. 

Thus, you end up setting limits that are far outside of what might actually 
constitute a problem, because you might have that happen for a few minutes - 
perhaps a few times a day - and you don't want nagios [or smokeping] to alert 
on all those instances. So, that means you inevitably miss events that are 

So, I'm wanting a smokeping plug-in that you can set it to average the last X 
number of minutes/hours/whatever of loss/latency/jitter and generate 
warnings/critical events.

So, I need to query the RRD's and pull stats.

Ok, now that I've got you so far [Thanks by the way!] - here's the problem I've 
[I'm a terrible coder, I have a short attention span, I am even worse at perl, 
and I hate details! So, be patient with me!]

Code snippet: [I stole this off the web somewhere, I don't recall where...]
#!/usr/bin/perl -W

use lib qw( /usr/lib/arm-linux-gnueabihf/perl5/5.20 ../lib/perl );
use RRDs;
use POSIX qw(strftime);

#start_time is the oldest data-point, and end_time is the newest.
my $cur_time = time();                # set current time
my $end_time = $cur_time - 60;     # set end time to 1m ago
my $start_time = $end_time - 600; # set start 10m in the past
my $rrd_res = 60;
my $temp_var = "";

#$f_cur_time = ctime($cur_time);
#$f_end_time = ctime($end_time);
#$f_start_time = ctime($start_time);
#$f_end_time = ctime($end_time);

print "CT: $cur_time \n";
print strftime("%m/%d/%Y %H:%M:%S",localtime($cur_time));
print "\n \n";

print "ET: $end_time \n";
print strftime("%m/%d/%Y %H:%M:%S",localtime($end_time));
print "\n \n";

print "ST: $start_time \n";
print strftime("%m/%d/%Y %H:%M:%S",localtime($start_time));
print "\n \n";

# fetch average values from the RRD database between start and end time
my ($start,$step,$ds_names,$data) =
   RRDs::fetch("/var/lib/smokeping/Some-CPE.rrd", "AVERAGE",
               "-r", "$rrd_res", "-s", "$start_time", "-e", "$end_time");

# save fetched values in a 2-dimensional array
my $rows = 0;
my $columns = 0;
my $time_variable = $start;

print "Start: $start : ";
print strftime("%m/%d/%Y %H:%M:%S",localtime($start));
print "\n \n";
print "step: $step \n";

print "start loop \n";
print " --- \n";
foreach $line (@$data) {
 $vals[$rows][$columns] = $time_variable;
 $temp_var = $time_variable;
 print strftime("%m/%d/%Y %H:%M:%S",localtime($temp_var));
 print "\n";  

 $time_variable = $time_variable + $step;
 $temp_var = $time_variable;
 print strftime("%m/%d/%Y %H:%M:%S",localtime($temp_var));
 print "\n";  
 foreach $val (@$line) {
                       print " --- \n";
                        print "row: $rows - col: $columns \n";
                        print "Val: $val ";
                         $vals[$rows][++$columns] = $val;
                        print "VC: $vals[$rows][$columns] \n";
                        print " --- \n";
 $columns = 0;


I've put in a bunch of print statements so I can try to figure out what's going 
on. [You can ignore all that...]
There's also some errors in the for loop, because it parses more rows than 
exist in the fetch - but ignore that too. [At least for now. Or you can tell me 
why - if you like. I'm pretty sure I'll figure it out.]

But what's interesting [at least right now] is that the first two columns have 
Column one [or the first returned value from every row] appears to always be 
And the second always appears to be zero.
[At least in my case, with my RRDs.]
But I'm pretty sure it's the same with any RRD from smokeping.

I may not understand [almost certainly don't] what's going on, but I'd have 
expected the values in the columns 3-23 to start at 1 and go through 20. [I do 
20 samples in this RRD per row.]

So, can someone explain why the first value [column] is always null, and the 
second is always zero? [These are all full resolution samples, no aggregation 
has occurred.]

Thanks for anyone who takes a stab at it.
And if you're reading Tobi, I'd be glad for your input and/or thoughts.


Gregory Sloop, Principal: Sloop Network & Computer Consulting
Voice: 503.251.0452 x82
smokeping-users mailing list

Reply via email to