I was struggling with gapped output in the weeks leading up to 2.5.0's release. I tried a staggering array of (mostly useless) tweaks.

Looks like your Apache rules won't let me view that URL (403), so I'm just gonna guess.

People have cited disk I/O as a bottleneck. I personally doubt this. If it were true, you'd be seeing random gaps whenever RRD updates came thick and fast (i.e. while all threads were updating RRDs at once), and the failures should be at least a bit more distributed - not just in one cluster.

I really hope you aren't mixing Linux, FreeBSD and IRIX nodes *WITHIN* the same cluster. They support different metric configurations which, unfortunately, don't play well with one another, and depending on the platform gmetad queries, you may get wildly different results (not to mention funky heartbeats). Individual clusters should be as homogenous as possible, or you should check out metric.h and make sure all the members are using only the metrics they all understand, and in that order.

Usually the "smoking gun" error message is in the lines above the data_thread() error. But I don't remember whether my spuriously verbose debug statements made it into 2.5.0 release, so you may want to add some of your own (I could grep for "error" and get most of the useful output).

See if (and where) the data thread is marking the source as dead. Try to find out if this is due to an RRD write problem. I don't know if my retry-once RRD writing code made it into 2.5.0, but maybe you should add it in if it didn't. That smoothed things out a bit for me.

Of course it didn't fix things entirely. Matt figured that one out - adding a mutex around the actual librrd call. Doh!

So, I have no idea if any of the above helps but maybe it'll give you some ideas... and anyway I felt obligated to write since that IRIX port is mostly my fault. :P

Ryan Sweet wrote:
one additional bit: running gmetad with debug on yeild lots of normal rrd updates, followed by:

data_thread() couldn't parse the XML and data to RRD for [primary network]

where [primary network] is the data source I'm having trouble with....

regards,
-ryan

On Tue, 24 Sep 2002, Ryan Sweet wrote:


Sorry if this mail is more appropriate to the users list. I'll carry it over there if that turns out to be the case. As it is, I'm trying to use the vast improvement that is 2.5.0 to get some time from management to work on ganglia.

I've got several small (<32 nodes) clusters of Linux systems, with freebsd
or Linux file servers that are working great with gmond.

I'm running gmetad on a Linux 2.4.19 system (RH 7.1+updates). It sees the clusters just fine, and the web front end makes great graphs of them, etc....

I'm also using gmond to monitor our workstation network (mix of IRIX,
FreeBSD, Linux), with the same gmetad collecting the data; herein lies the
problem. With the old gmond (2.4.1) things mostly worked, though we often had IRIX machines where gmond would just silently segfault and never be heard from. We also had a problem with machines (also mostly the IRIX) being marked as down from time to time when they (and their gmond) were actually fine, nevertheless, it was usable, and mostly consistent.

With 2.5.0, gmond is much more stable, and it has stopped marking live hosts as dead, however on the workstation network (which happens to be the same network as the gmetad server), the web frontend is showing graphs that have large gaps in them. The values reported for "now" always appear to be correct, but the values are graphed incorrectly.

For an example (not live, just a dump to html), see http://wwwx.atos-group.nl/admn/gmetad_ex/gmetad.html

This particular graph is for an Linux system, but it is on the same multi-cast channel as the IRIX systems....

So where should I begin to look? I suspect that it is actually a problem with gmond, most likely on the IRIX systems, since gmetad and the web front end are working great on the clusters.
regards,
-Ryan








Reply via email to