Well, I have no idea if this is an "official" solution but it sure as heck
worked for me. I thought I'd share.
Here's the problem I was having, in a nutshell:
* Boxes in my Solaris cluster appeared to disappear and reappear between
page views of gmetad-frontend. i.e., metacluster view says 18 hosts are
down. Click on fileserver cluster, 2 hosts are down. Hit refresh, 0 hosts
are down. A minute later, hit refresh, 16 hosts are down ...
* My Solaris boxes were not posting updates frequently enough ... or,
more to the point, they were shotgunning updates almost all at the same
time. So instead of updates trickling in every rand(40 seconds, 60
seconds), they came in all at the same time.
* gmetad determines a host is down based on the REPORTED field in the
host DTD. It compares this value to the current time, and if the
difference in time is greater than or equal to the hardcoded threshold
value, it marks the host as DOWN and ignores its metrics.
* Cranking the timeout value to over 60 seconds (i.e. 75 seconds :) )
mostly fixed the problem ... but introduced a new one! The metacluster
info expected more frequent updates than this. So even though the
individual cluster graphs are "fixed," it breaks the metacluster info.
* Looked through gmond/metric.h ... aha! The most-often-reported metric
was load_one, which was being polled every 15-20 seconds ... but reported
every 50-70 seconds! I reduced the reporting thresholds to 35-50 seconds
and distributed the new gmond.
* So far the glitching has been reduced and/or possibly eliminated.
Looking pretty good so far (60 minutes and counting)!