the gaps in the gmetad graphs are caused by *UNKNOWN* data. let's walk
through this is figure out what is going on....
here is the relavent gmetad code...in ./gmetad/rrd_helpers.c RRD_create().
----------------- begin code snip -------------------------
/* Our heartbeat is twice the step interval which is always 15. */
heartbeat = 2*step;
argv[argc++] = "dummy";
argv[argc++] = rrd;
argv[argc++] = "--step";
sprintf(s, "%u", step);
argv[argc++] = s;
sprintf(sum,"DS:sum:GAUGE:%d:U:U", heartbeat);
argv[argc++] = sum;
if (summary) {
sprintf(num,"DS:num:GAUGE:%d:U:U", heartbeat);
argv[argc++] = num;
}
argv[argc++] = "RRA:AVERAGE:0.5:1:240";
argv[argc++] = "RRA:AVERAGE:0.5:24:240";
argv[argc++] = "RRA:AVERAGE:0.5:168:240";
argv[argc++] = "RRA:AVERAGE:0.5:672:240";
argv[argc++] = "RRA:AVERAGE:0.5:5760:370";
------------------ end code snip --------------------
for every RRDb the step is 15 and the heartbeat is 30. for non-summary
databases we have one DS (data source) called "sum". summary databases
also have a second DS called "num" which hold the number of hosts in the
summation. both the "num" and "sum" DS have a 30 second heartbeat and the
max and min values are set to "U" meaning.. they don't exist.
there are 5 RRA (round-robin archives). each RRA uses the AVERAGE
consolidation function and has a 0.5 xff (The xfiles factor defines
what part of a consolidation interval may be made up from *UNKNOWN* data
while the consolidated value is still regarded as known). in short, if
half of the values over a consolidation internal are UNKNOWN then the
whole consolidated value is marked as *UNKNOWN*.
here is something else to think about... and i'll comment more
afterwards...
-------------------------------------------------------------------------
http://people.ee.ethz.ch/~oetiker/webtools/rrdtool/manual/rrdcreate.html
-------------------------------------------------------------------------
Here is an explanation by Don Baarda on the inner workings of rrdtool. It
may help you to sort out why all this *UNKNOWN* data is popping up in your
databases:
RRD gets fed samples at arbitrary times. From these it builds Primary Data
Points (PDPs) at exact times every ``step'' interval. The PDPs are then
accumulated into RRAs.
The ``heartbeat'' defines the maximum acceptable interval between samples.
If the interval between samples is less than ``heartbeat'', then an
average rate is calculated and applied for that interval. If the interval
between samples is longer than ``heartbeat'', then that entire interval is
considered ``unknown''. Note that there are other things that can make a
sample interval ``unknown'', such as the rate exceeding limits, or even an
``unknown'' input sample.
The known rates during a PDP's ``step'' interval are used to calculate an
average rate for that PDP. Also, if the total ``unknown'' time during the
``step'' interval exceeds the ``heartbeat'', the entire PDP is marked as
``unknown''. This means that a mixture of known and ``unknown'' sample
time in a single PDP ``step'' may or may not add up to enough ``unknown''
time to exceed ``heartbeat'' and hence mark the whole PDP ``unknown''. So
``heartbeat'' is not only the maximum acceptable interval between samples,
but also the maximum acceptable amount of ``unknown'' time per PDP
(obviously this is only significant if you have ``heartbeat'' less than
``step'').
The ``heartbeat'' can be short (unusual) or long (typical) relative to the
``step'' interval between PDPs. A short ``heartbeat'' means you require
multiple samples per PDP, and if you don't get them mark the PDP unknown.
A long heartbeat can span multiple ``steps'', which means it is acceptable
to have multiple PDPs calculated from a single sample. An extreme example
of this might be a ``step'' of 5mins and a ``heartbeat'' of one day, in
which case a single sample every day will result in all the PDPs for that
entire day period being set to the same average rate.
-- Don Baarda <[EMAIL PROTECTED]>
----------------------------- end great info --------------------------
ok.. wow... let's try to simplify this..
first.. everything in rrdland is a simple timestamp/value pair. the
primary data points (PDPs) are "snapped" to the specified "step" interval
(even if it's not exact)...
for example...
00:00 insert value 5
00:20 insert value 10
00:35 insert value 7
00:45 insert value 9
00:60 insert value 10
here is what rrd returns... (i actually ran this using rrdtool, btw)...
00:00 5
00:15 9.333333333333333
00:30 8.4
00:45 8.066666666666666
00:60 9.866666666666666
sooo.... at 15 second intervals rrdtool interpolates... it knows at 00:00
the value is 5 and at 00:20 the value is 10.. on and on and on... it is
interpolating at each step along the way.
wow! that gives me a great idea of how to make gmetad MUCH less disk i/o
intensive... (have a HUGE heartbeat and use explicit *UNKNOWN* values for
dead data sources and only write significant CHANGES in value...later...
g3)... focus.. focus...
the heartbeat is currently set way too small. since it is only 2x the
step, if any data source takes 30 seconds to collect, parse and write
(that'll happen!).. then it gets marked as *UNKNOWN*.
here is a test to see if we can reduce the gaps in your images...
1. (re)move your old RRDbs in /var/lib/ganglia/rrds
(i know.. that sux.. sorry)
2. change line 79 in ./gmetad/rrd_helpers.c from
/* Our heartbeat is twice the step interval. */
heartbeat = 2*step;
to be
/* Out heartbeat interval is eight times the step interval */
heartbeat = 8*step;
3. recompile gmetad and give it a try.
this new gmetad will likely have much less gaps but the only catch is
this. if a data source goes offline, you will not see the gap in the
graph (telling you the data source is dead) until eight steps (2 minutes).
i think that is a small price to pay.
i'm sorry that you haven't heard much from me lately... my time is being
consumed by writing ganglia 3 and doing talks (about ganglia). i don't
want to over-promise anything so mums the word but the current limitations
of gmetad (v2) will disappear in v3. i hope this small hack helps.
-matt
ps. have a great weekend guys!