Re: [Ganglia-developers] 2.5.0 experiences

Steven Wagner Tue, 24 Sep 2002 10:50:03 -0700

I was struggling with gapped output in the weeks leading up to 2.5.0'srelease. I tried a staggering array of (mostly useless) tweaks.

Looks like your Apache rules won't let me view that URL (403), so I'm justgonna guess.

People have cited disk I/O as a bottleneck. I personally doubt this. Ifit were true, you'd be seeing random gaps whenever RRD updates came thickand fast (i.e. while all threads were updating RRDs at once), and thefailures should be at least a bit more distributed - not just in one cluster.

I really hope you aren't mixing Linux, FreeBSD and IRIX nodes *WITHIN* thesame cluster. They support different metric configurations which,unfortunately, don't play well with one another, and depending on theplatform gmetad queries, you may get wildly different results (not tomention funky heartbeats). Individual clusters should be as homogenous aspossible, or you should check out metric.h and make sure all the membersare using only the metrics they all understand, and in that order.

Usually the "smoking gun" error message is in the lines above thedata_thread() error. But I don't remember whether my spuriously verbosedebug statements made it into 2.5.0 release, so you may want to add some ofyour own (I could grep for "error" and get most of the useful output).

See if (and where) the data thread is marking the source as dead. Try tofind out if this is due to an RRD write problem. I don't know if myretry-once RRD writing code made it into 2.5.0, but maybe you should add itin if it didn't. That smoothed things out a bit for me.

Of course it didn't fix things entirely. Matt figured that one out -adding a mutex around the actual librrd call. Doh!

So, I have no idea if any of the above helps but maybe it'll give you someideas... and anyway I felt obligated to write since that IRIX port ismostly my fault. :P


Ryan Sweet wrote:

one additional bit: running gmetad with debug on yeild lots of normal rrdupdates, followed by:
data_thread() couldn't parse the XML and data to RRD for [primary network]

where [primary network] is the data source I'm having trouble with....

regards,
-ryan

On Tue, 24 Sep 2002, Ryan Sweet wrote:
Sorry if this mail is more appropriate to the users list. I'll carry itover there if that turns out to be the case. As it is, I'm trying to usethe vast improvement that is 2.5.0 to get some time from management towork on ganglia.
I've got several small (<32 nodes) clusters of Linux systems, with freebsd
or Linux file servers that are working great with gmond.
I'm running gmetad on a Linux 2.4.19 system (RH 7.1+updates). It seesthe clusters just fine, and the web front end makes great graphs of them,etc....
I'm also using gmond to monitor our workstation network (mix of IRIX,
FreeBSD, Linux), with the same gmetad collecting the data; herein lies the
problem.With the old gmond (2.4.1) things mostly worked, though we oftenhad IRIX machines where gmond would just silently segfault and never beheard from. We also had a problem with machines (also mostly the IRIX)being marked as down from time to time when they (and their gmond) wereactually fine, nevertheless, it was usable, and mostly consistent.
With 2.5.0, gmond is much more stable, and it has stopped marking livehosts as dead, however on the workstation network (which happens to be thesame network as the gmetad server), the web frontend is showing graphsthat have large gaps in them. The values reported for "now" alwaysappear to be correct, but the values are graphed incorrectly.
For an example (not live, just a dump to html), seehttp://wwwx.atos-group.nl/admn/gmetad_ex/gmetad.html
This particular graph is for an Linux system, but it is on the samemulti-cast channel as the IRIX systems....
So where should I begin to look? I suspect that it is actually a problemwith gmond, most likely on the IRIX systems, since gmetad and the webfront end are working great on the clusters.
regards,
-Ryan

Re: [Ganglia-developers] 2.5.0 experiences

Reply via email to