[Ganglia-general] [gmetad] spotty updates - solution :)

Steven Wagner Wed, 03 Jul 2002 11:17:48 -0700

Well, I have no idea if this is an "official" solution but it sure as heckworked for me. I thought I'd share.


Here's the problem I was having, in a nutshell:

* Boxes in my Solaris cluster appeared to disappear and reappear betweenpage views of gmetad-frontend. i.e., metacluster view says 18 hosts aredown. Click on fileserver cluster, 2 hosts are down. Hit refresh, 0 hostsare down. A minute later, hit refresh, 16 hosts are down ...* My Solaris boxes were not posting updates frequently enough ... or,more to the point, they were shotgunning updates almost all at the sametime. So instead of updates trickling in every rand(40 seconds, 60seconds), they came in all at the same time.* gmetad determines a host is down based on the REPORTED field in thehost DTD. It compares this value to the current time, and if thedifference in time is greater than or equal to the hardcoded thresholdvalue, it marks the host as DOWN and ignores its metrics.* Cranking the timeout value to over 60 seconds (i.e. 75 seconds :) )mostly fixed the problem ... but introduced a new one! The metaclusterinfo expected more frequent updates than this. So even though theindividual cluster graphs are "fixed," it breaks the metacluster info.* Looked through gmond/metric.h ... aha! The most-often-reported metricwas load_one, which was being polled every 15-20 seconds ... but reportedevery 50-70 seconds! I reduced the reporting thresholds to 35-50 secondsand distributed the new gmond.* So far the glitching has been reduced and/or possibly eliminated.Looking pretty good so far (60 minutes and counting)!

[Ganglia-general] [gmetad] spotty updates - solution :)

Reply via email to