Hi Vladimir,

I have reviewed the clock in all nodes, not only from cluster1, and seems
it's accurate.

How can I determine which gmond is the metrics source for cluster1? It's
the first entry in data_source "cluster1"? I have been debugging with
"tcpdump" and seems it's the first IP but I want to be 100% sure, thanks!

P.S.: I forgot to say that we're using unicast since it's AWS based
platform.

Regards,

Santi Saez

2015-09-22 2:50 GMT+02:00 Vladimir Vuksan <vli...@veus.hr>:

> Is the clock accurate on the gmond that is collecting metrics for cluster1
> ?
>
> Vladimir
>
>
> 09/21/2015 u 01:38 PM, Santi Saez je napisao/la:
>
> Hi,
>
> Seems I'm dealing with a similar issue like this:
>
> [1]
> https://www.mail-archive.com/ganglia-general@lists.sourceforge.net/msg02235.html
>
> Config: it's a setup running Ganglia v3.6.0 with ~100 nodes, nodes are
> distributed in ~20 clusters, each cluster has ~2 nodes and there's one
> larger cluster with ~30 nodes, the one that is dropping/discarding metrics.
>
> This setup has been working without issues for the last 2 years, until we
> restarted the server in which "gmetad" runs. From that moment, the "big
> cluster" began to fail randomly and almost all the metrics from this
> cluster are discarded, the issue seems related with some type of issue with
> the clock settings and TMAX value.
>
> After reading several troubleshooting guides this is what I have found:
>
> - LOCALTIME value for the cluster with problems, say "cluster1", seems
> delayed, like ~300 seconds, that's the most important hint I have found:
>
> <CLUSTER NAME="cluster1" LOCALTIME="1442855337" OWNER="unspecified"
> LATLONG="unspecified" URL="unspecified">
> <CLUSTER NAME="cluster2" LOCALTIME="1442855687" OWNER="unspecified"
> LATLONG="unspecified" URL="unspecified">
> <CLUSTER NAME="cluster3" LOCALTIME="1442855696" OWNER="unspecified"
> LATLONG="unspecified" URL="unspecified">
>
> All the nodes in "cluster1" has well configured the clock, I have double
> checked it, in fact they're synchronized with NTP.
>
> - doing some debug I have found this:
>
> <HOST NAME="nodeXXX" IP="x.x.x.x" REPORTED="1442845480" TN="219" TMAX="20"
> DMAX="86400" LOCATION="unspecified" GMOND_STARTED="1442842343" TAGS="">
>
> Seems that TN is bigger than TMAX * 4 and ganglia-webfrontend reports as
> "down" all the nodes from this cluster, see [1] to understand this
> behaviour, so I have applied the workaround pointed in [1], patch
> host_alive() function in ganglia.php, and seems it works, at least the
> nodes appears in the cluster view but they have no data, seems that
> "gmetad" is discarding data.
>
> I have tried several paths: increasing "heartbeat" time_threshold,
> adding host_tmax setting, etc. but it doesn't work, seems that "gmetad" is
> discarding metrics and they're not finally written to .rrd files.
>
> Any clue of what may be wrong? Thanks!
>
> Regards,
>
> Santi Saez
>
>
> ------------------------------------------------------------------------------
>
>
>
> _______________________________________________
> Ganglia-general mailing 
> listGanglia-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/ganglia-general
>
>
------------------------------------------------------------------------------
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to