Ian,

You're right. My terminology was quite mixed up. I was jumping back into
ganglia after a long time away and based on explanations of the setup from
another admin that were slightly inaccurate.

I've solved the problem - thanks to you. What you said made me realize what
was wrong with our configs. Let me try a fresh new explanation... this one
may be useful for the archives.

- We have rougly 10 clusters in a grid.
- Within each cluster, all of the systems unicast their data to two specific
systems within the cluster.
- gmetad pulls from each one of these clusters using a data_source and both
'collector hosts' as arguments to the data_source. There are two collector
hosts for redundancy.
- In addition, all hosts also unicasted themselves to another 'cluster'
which consisted of 'all hosts' - this gmond instance ran on the same box as
the gmetad.

There were two problems:
 1. The 'all hosts' cluster had the same name as the 'grid' - this seems to
cause _great_ confusion. Renaming this cluster to something else caused the
grid view to immediately start working again. This became obvious to me once
I understood how ganglia was working.

 2. The 'all hosts' cluster gets completely overrun. gmond doesn't keep up
very well, and the data has gaps and is often significantly behind.

The solution to the first one was easy - remove the naming conflict. As for
the second problem I reconfigured things to only send themselves to the
gmond running on the ganglia server if they were not part of an existing
cluster. I then named this cluster 'other hosts', and told gmetad to pull
'other-hosts' from the localhost gmond.

That fixes the problem I reported.

Sadly, I have one other, very small problem - but I'll send another email
for that.

Thanks for your help. Sorry for the very poor initial description.

-- 
Phil Dibowitz
P: 310-360-2330 C: 213-923-5115
Unix Admin, Ticketmaster.com

"Never write it in C if you can do it in 'awk';
 Never do it in 'awk' if 'sed' can handle it;
 Never use 'sed' when 'tr' can do the job;
 Never invoke 'tr' when 'cat' is sufficient;
 Avoid using 'cat' whenever possible" -- Taylor's Laws of Programming


Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to