We use ganglia to monitor > 500 hosts in multiple datacenters with about 90k unique host:metric pairs per DC. We use this data for all of the cool graphs in the web UI and for passive alerting.
One of our checks is to measure TN of load_one on every box (we want to make sure gmond is working and correctly updating metrics otherwise we could be blind and not know it). We consider it a failure if TN is > 600. This is an arbitrary number but 10 minutes seemed plenty long. Unfortunately we are seeing this check fail far too often. We set up two parallel gmetad instances (monitoring identical gmonds) per DC and have broken our problem into two classes: * (A) only one of the gmetad stops updating for an entire cluster, and must be restarted to recover. Since the gmetad's disagree we know the problem is there. [1] * (B) Both gmetad's say an individual host has not reported (gmond aggregation or sending must be at fault). This issue is usually transient (that is it recovers after some period of time greater than 10 minutes). While attempting to reproduce (A) we ran several additional gmetad instances (again polling the same gmonds) around 2012-12-07. Failures per day are below [2]. The act of testing seems to have significantly increased the number of failures. This lead us to consider if the act of polling a gmond aggregator could impact the ability for it to concurrently collect metrics. We looked at the code but are not experienced with concurrent programming in C. Could someone with more familiarity with the gmond code comment as to if this is likely to be a worthwhile avenue of investigation? We are also looking to for suggestion for an empirical test to rule this out. (Of course, other comments on the root "TN goes up, metrics stop updating" sporadic problem are also welcome!) Thank you, Chris Burroughs [1] https://github.com/ganglia/monitor-core/issues/47 [2] 120827 89 120828 6 120829 3 120830 4 120831 5 120901 1 120902 6 120903 2 120904 9 120905 4 120906 70 120907 523 120908 85 120909 4 120910 6 120911 2 120912 5 120913 5 ------------------------------------------------------------------------------ Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219671;13503038;y? http://info.appdynamics.com/FreeJavaPerformanceDownload.html _______________________________________________ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general