I'm the developer of a performance monitoring tool called collectl - see http://collectl.sourceforge.net/ which is a fairly light-weight data collector, capable of collecting most relevant performance metrics. In addition to the basics that most tools collect like cpu, memory, disk and network, it also collects such things as nfs, infiniband, lustre, buddyino and interrupts by cpu just to name few. What I believe makes collectl different is that it logs data locally to disk, which gives it the ability to take samples in the 5-10 second range at under 0.2% cpu overhead and go even lower if you need to. It can even do fractional intervals. The point is this is simply far too much data to ever try to monitor/display with ganglia, but I also believe these are the frequencies of data collection you need if you want to see what your system is really doing in diagnostic situations. At the same time, if you want to get a good overview of what your cluster is doing as a whole, you need ganglia.
However, it doesn't make a lot of sense to have 2 tools collecting the same data, one doing it locally and the other doing it for the cluster. Working with Evan and Ken, I've added a new capability to collectl to send its data (or even a subset if you prefer) to gdmond in binary format and it seems to work just fine. The magic here is you can tell collectl to take samples every 10 seconds (its default) and write them to a local file, but only send samples every minute (or whatever you choose) to gmond. That way you can get your coarser data for ganglia to display (and get a lot more than gmond normally collects) and still have fine-grained data for diagnostic purposes if/when you need it. This then leads to my question, which is what is the best way to send data to ganglia. I want to keep my messages very dense and so we chose to simply send out binary data in the same format gmond expects. In the case of pnnl, where they have a monitoring hierarchy, we've completely replaced all the monitoring gmonds with a dozen that act only as aggregators. There are about 190 nodes running collectl sending UPD messages to each aggregator gmonds and it seems to run just fine. Does this make sense? Is there anything to watch out for? If anyone else is interested in trying this out while we're shaking out the code, I'd be happy to share some pre-release code with a few people. -mark ------------------------------------------------------------------------------ _______________________________________________ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers