Well, I have no idea if this is an "official" solution but it sure as heck worked for me. I thought I'd share.

Here's the problem I was having, in a nutshell:

* Boxes in my Solaris cluster appeared to disappear and reappear between page views of gmetad-frontend. i.e., metacluster view says 18 hosts are down. Click on fileserver cluster, 2 hosts are down. Hit refresh, 0 hosts are down. A minute later, hit refresh, 16 hosts are down ... * My Solaris boxes were not posting updates frequently enough ... or, more to the point, they were shotgunning updates almost all at the same time. So instead of updates trickling in every rand(40 seconds, 60 seconds), they came in all at the same time. * gmetad determines a host is down based on the REPORTED field in the host DTD. It compares this value to the current time, and if the difference in time is greater than or equal to the hardcoded threshold value, it marks the host as DOWN and ignores its metrics. * Cranking the timeout value to over 60 seconds (i.e. 75 seconds :) ) mostly fixed the problem ... but introduced a new one! The metacluster info expected more frequent updates than this. So even though the individual cluster graphs are "fixed," it breaks the metacluster info. * Looked through gmond/metric.h ... aha! The most-often-reported metric was load_one, which was being polled every 15-20 seconds ... but reported every 50-70 seconds! I reduced the reporting thresholds to 35-50 seconds and distributed the new gmond. * So far the glitching has been reduced and/or possibly eliminated. Looking pretty good so far (60 minutes and counting)!



Reply via email to