I'm the developer of a performance monitoring tool called collectl - see 
http://collectl.sourceforge.net/ which is a fairly light-weight data 
collector, capable of collecting most relevant performance metrics.  In 
addition to the basics that most tools collect like cpu, memory, disk 
and network, it also collects such things as nfs, infiniband, lustre, 
buddyino and interrupts by cpu just to name few.  What I believe makes 
collectl different is that it logs data locally to disk, which gives it 
the ability to take samples in the 5-10 second range at under 0.2% cpu 
overhead and go even lower if you need to.  It can even do fractional 
intervals.  The point is this is simply far too much data to ever try to 
monitor/display with ganglia, but I also believe these are the 
frequencies of data collection you need if you want to see what your 
system is really doing in diagnostic situations.  At the same time, if 
you want to get a good overview of what your cluster is doing as a 
whole, you need ganglia.

However, it doesn't make a lot of sense to have 2 tools collecting the 
same data, one doing it locally and the other doing it for the cluster.

Working with Evan and Ken, I've added a new capability to collectl to 
send its data (or even a subset if you prefer) to gdmond in binary 
format and it seems to work just fine.   The magic here is you can tell 
collectl to take samples every 10 seconds (its default) and write them 
to a local file, but only send samples every minute (or whatever you 
choose) to gmond. That way you can get your coarser data for ganglia to 
display (and get a lot more than gmond normally collects) and still have 
fine-grained data for diagnostic purposes if/when you need it.

This then leads to my question, which is what is the best way to send 
data to ganglia.  I want to keep my messages very dense and so we chose 
to simply send out binary data in the same format gmond expects.  In the 
case of pnnl, where they have a monitoring hierarchy, we've completely 
replaced all the monitoring gmonds with a dozen that act only as 
aggregators.  There are about 190 nodes running collectl sending UPD 
messages to each aggregator gmonds and it seems to run just fine.  Does 
this make sense?  Is there anything to watch out for?

If anyone else is interested in trying this out while we're shaking out 
the code, I'd be happy to share some pre-release code with a few people.

-mark



------------------------------------------------------------------------------
_______________________________________________
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers

Reply via email to