Hi, Seems I'm dealing with a similar issue like this:
[1] https://www.mail-archive.com/ganglia-general@lists.sourceforge.net/msg02235.html Config: it's a setup running Ganglia v3.6.0 with ~100 nodes, nodes are distributed in ~20 clusters, each cluster has ~2 nodes and there's one larger cluster with ~30 nodes, the one that is dropping/discarding metrics. This setup has been working without issues for the last 2 years, until we restarted the server in which "gmetad" runs. From that moment, the "big cluster" began to fail randomly and almost all the metrics from this cluster are discarded, the issue seems related with some type of issue with the clock settings and TMAX value. After reading several troubleshooting guides this is what I have found: - LOCALTIME value for the cluster with problems, say "cluster1", seems delayed, like ~300 seconds, that's the most important hint I have found: <CLUSTER NAME="cluster1" LOCALTIME="1442855337" OWNER="unspecified" LATLONG="unspecified" URL="unspecified"> <CLUSTER NAME="cluster2" LOCALTIME="1442855687" OWNER="unspecified" LATLONG="unspecified" URL="unspecified"> <CLUSTER NAME="cluster3" LOCALTIME="1442855696" OWNER="unspecified" LATLONG="unspecified" URL="unspecified"> All the nodes in "cluster1" has well configured the clock, I have double checked it, in fact they're synchronized with NTP. - doing some debug I have found this: <HOST NAME="nodeXXX" IP="x.x.x.x" REPORTED="1442845480" TN="219" TMAX="20" DMAX="86400" LOCATION="unspecified" GMOND_STARTED="1442842343" TAGS=""> Seems that TN is bigger than TMAX * 4 and ganglia-webfrontend reports as "down" all the nodes from this cluster, see [1] to understand this behaviour, so I have applied the workaround pointed in [1], patch host_alive() function in ganglia.php, and seems it works, at least the nodes appears in the cluster view but they have no data, seems that "gmetad" is discarding data. I have tried several paths: increasing "heartbeat" time_threshold, adding host_tmax setting, etc. but it doesn't work, seems that "gmetad" is discarding metrics and they're not finally written to .rrd files. Any clue of what may be wrong? Thanks! Regards, Santi Saez
------------------------------------------------------------------------------
_______________________________________________ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general