Hi Vladimir, I have reviewed the clock in all nodes, not only from cluster1, and seems it's accurate.
How can I determine which gmond is the metrics source for cluster1? It's the first entry in data_source "cluster1"? I have been debugging with "tcpdump" and seems it's the first IP but I want to be 100% sure, thanks! P.S.: I forgot to say that we're using unicast since it's AWS based platform. Regards, Santi Saez 2015-09-22 2:50 GMT+02:00 Vladimir Vuksan <vli...@veus.hr>: > Is the clock accurate on the gmond that is collecting metrics for cluster1 > ? > > Vladimir > > > 09/21/2015 u 01:38 PM, Santi Saez je napisao/la: > > Hi, > > Seems I'm dealing with a similar issue like this: > > [1] > https://www.mail-archive.com/ganglia-general@lists.sourceforge.net/msg02235.html > > Config: it's a setup running Ganglia v3.6.0 with ~100 nodes, nodes are > distributed in ~20 clusters, each cluster has ~2 nodes and there's one > larger cluster with ~30 nodes, the one that is dropping/discarding metrics. > > This setup has been working without issues for the last 2 years, until we > restarted the server in which "gmetad" runs. From that moment, the "big > cluster" began to fail randomly and almost all the metrics from this > cluster are discarded, the issue seems related with some type of issue with > the clock settings and TMAX value. > > After reading several troubleshooting guides this is what I have found: > > - LOCALTIME value for the cluster with problems, say "cluster1", seems > delayed, like ~300 seconds, that's the most important hint I have found: > > <CLUSTER NAME="cluster1" LOCALTIME="1442855337" OWNER="unspecified" > LATLONG="unspecified" URL="unspecified"> > <CLUSTER NAME="cluster2" LOCALTIME="1442855687" OWNER="unspecified" > LATLONG="unspecified" URL="unspecified"> > <CLUSTER NAME="cluster3" LOCALTIME="1442855696" OWNER="unspecified" > LATLONG="unspecified" URL="unspecified"> > > All the nodes in "cluster1" has well configured the clock, I have double > checked it, in fact they're synchronized with NTP. > > - doing some debug I have found this: > > <HOST NAME="nodeXXX" IP="x.x.x.x" REPORTED="1442845480" TN="219" TMAX="20" > DMAX="86400" LOCATION="unspecified" GMOND_STARTED="1442842343" TAGS=""> > > Seems that TN is bigger than TMAX * 4 and ganglia-webfrontend reports as > "down" all the nodes from this cluster, see [1] to understand this > behaviour, so I have applied the workaround pointed in [1], patch > host_alive() function in ganglia.php, and seems it works, at least the > nodes appears in the cluster view but they have no data, seems that > "gmetad" is discarding data. > > I have tried several paths: increasing "heartbeat" time_threshold, > adding host_tmax setting, etc. but it doesn't work, seems that "gmetad" is > discarding metrics and they're not finally written to .rrd files. > > Any clue of what may be wrong? Thanks! > > Regards, > > Santi Saez > > > ------------------------------------------------------------------------------ > > > > _______________________________________________ > Ganglia-general mailing > listGanglia-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/ganglia-general > >
------------------------------------------------------------------------------
_______________________________________________ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general