We use ganglia to monitor > 500 hosts in multiple datacenters with about
90k unique host:metric pairs per DC.  We use this data for all of the
cool graphs in the web UI and for passive alerting.

One of our checks is to measure TN of load_one on every box (we want to
make sure gmond is working and correctly updating metrics otherwise we
could be blind and not know it).  We consider it a failure if TN is >
600.  This is an arbitrary number but 10 minutes seemed plenty long.

Unfortunately we are seeing this check fail far too often.  We set up
two parallel gmetad instances (monitoring identical gmonds) per DC and
have broken our problem into two classes:
 * (A) only one of the gmetad stops updating for an entire cluster, and
must be restarted to recover.  Since the gmetad's disagree we know the
problem is there. [1]
 * (B) Both gmetad's say an individual host has not reported (gmond
aggregation or sending must be at fault).  This issue is usually
transient (that is it recovers after some period of time greater than 10
minutes).

While attempting to reproduce (A) we ran several additional gmetad
instances (again polling the same gmonds) around 2012-12-07.  Failures
per day are below [2].  The act of testing seems to have significantly
increased the number of failures.

This lead us to consider if the act of polling a gmond aggregator could
impact the ability for it to concurrently collect metrics.  We looked at
the code but are not experienced with concurrent programming in C.
Could someone with more familiarity with the gmond code comment as to if
this is likely  to be a worthwhile avenue of investigation?  We are also
looking to for suggestion for an empirical test to rule this out.

(Of course, other comments on the root "TN goes up, metrics stop
updating" sporadic problem are also welcome!)

Thank you,
Chris Burroughs


[1] https://github.com/ganglia/monitor-core/issues/47

[2]
120827  89
120828  6
120829  3
120830  4
120831  5
120901  1
120902  6
120903  2
120904  9
120905  4
120906  70
120907  523
120908  85
120909  4
120910  6
120911  2
120912  5
120913  5

------------------------------------------------------------------------------
Got visibility?
Most devs has no idea what their production app looks like.
Find out how fast your code is with AppDynamics Lite.
http://ad.doubleclick.net/clk;262219671;13503038;y?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to