Douglas, what Matthias said is good.
At one stage we had a grid of 6,000 servers in maybe 50+ clusters with 5-10 second polling (!!!).\ Here are my experiences and some tips, some you will already know: - The overhead of the gmond agent is very low on the monitored hosts, both for CPU and network I/O. Not storing any local data is a Good Thing. - Network overhead from UDP data is really, really low. In our case we unicast the UDP to headnodes. Headnode CPU load was still really small. - Your first (and also biggest) bottleneck is calling RRDupdate and writing RRD data to the filesystem. Many posts talk of this. We used SAN for the RRD files. Others made a tmpfs with rsync periodic backup. strace gmetad and you will see what I mean. - gmetad spawns 1 thread per data_source as best I see, and each thread does the TCP/XML data retrieval and then the RRD updates. This affected us because of the 5-10 second polling of data sources. - Personally I like 10 second polling, but it depends on your typical job durations. Tips? - Make a grid, chopping your cluster up. Helps on the display side too! - Integer values returned from gmond still give rise to RRD files that are updated at the poll rate, even if they are constant (e.g. cpu clock speed). Remove ones you don't need or morph them into string values. - Gaps in graphed data? For us it was the inability of each thread doing all it had too in within the polling interval window. The ganglia server itself did not run out of overall cpu, in fact it is quite low. - We also got the occassional gap exactly on the hour. Matt Toy postulated that this was the moment that RRD had to update its aggregated values. - Make gmetric scripts on the ganglia server that give you I/O wait, disk service time etc. Spikes in I/O wait correlated with gaps for us. umm. Mostly. regards, Richard G -- ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Ganglia-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/ganglia-general

