I was struggling with gapped output in the weeks leading up to 2.5.0's
release. I tried a staggering array of (mostly useless) tweaks.
Looks like your Apache rules won't let me view that URL (403), so I'm just
gonna guess.
People have cited disk I/O as a bottleneck. I personally doubt this. If
it were true, you'd be seeing random gaps whenever RRD updates came thick
and fast (i.e. while all threads were updating RRDs at once), and the
failures should be at least a bit more distributed - not just in one cluster.
I really hope you aren't mixing Linux, FreeBSD and IRIX nodes *WITHIN* the
same cluster. They support different metric configurations which,
unfortunately, don't play well with one another, and depending on the
platform gmetad queries, you may get wildly different results (not to
mention funky heartbeats). Individual clusters should be as homogenous as
possible, or you should check out metric.h and make sure all the members
are using only the metrics they all understand, and in that order.
Usually the "smoking gun" error message is in the lines above the
data_thread() error. But I don't remember whether my spuriously verbose
debug statements made it into 2.5.0 release, so you may want to add some of
your own (I could grep for "error" and get most of the useful output).
See if (and where) the data thread is marking the source as dead. Try to
find out if this is due to an RRD write problem. I don't know if my
retry-once RRD writing code made it into 2.5.0, but maybe you should add it
in if it didn't. That smoothed things out a bit for me.
Of course it didn't fix things entirely. Matt figured that one out -
adding a mutex around the actual librrd call. Doh!
So, I have no idea if any of the above helps but maybe it'll give you some
ideas... and anyway I felt obligated to write since that IRIX port is
mostly my fault. :P
Ryan Sweet wrote:
one additional bit: running gmetad with debug on yeild lots of normal rrd
updates, followed by:
data_thread() couldn't parse the XML and data to RRD for [primary network]
where [primary network] is the data source I'm having trouble with....
regards,
-ryan
On Tue, 24 Sep 2002, Ryan Sweet wrote:
Sorry if this mail is more appropriate to the users list. I'll carry it
over there if that turns out to be the case. As it is, I'm trying to use
the vast improvement that is 2.5.0 to get some time from management to
work on ganglia.
I've got several small (<32 nodes) clusters of Linux systems, with freebsd
or Linux file servers that are working great with gmond.
I'm running gmetad on a Linux 2.4.19 system (RH 7.1+updates). It sees
the clusters just fine, and the web front end makes great graphs of them,
etc....
I'm also using gmond to monitor our workstation network (mix of IRIX,
FreeBSD, Linux), with the same gmetad collecting the data; herein lies the
problem.
With the old gmond (2.4.1) things mostly worked, though we often
had IRIX machines where gmond would just silently segfault and never be
heard from. We also had a problem with machines (also mostly the IRIX)
being marked as down from time to time when they (and their gmond) were
actually fine, nevertheless, it was usable, and mostly consistent.
With 2.5.0, gmond is much more stable, and it has stopped marking live
hosts as dead, however on the workstation network (which happens to be the
same network as the gmetad server), the web frontend is showing graphs
that have large gaps in them. The values reported for "now" always
appear to be correct, but the values are graphed incorrectly.
For an example (not live, just a dump to html), see
http://wwwx.atos-group.nl/admn/gmetad_ex/gmetad.html
This particular graph is for an Linux system, but it is on the same
multi-cast channel as the IRIX systems....
So where should I begin to look? I suspect that it is actually a problem
with gmond, most likely on the IRIX systems, since gmetad and the web
front end are working great on the clusters.
regards,
-Ryan