Hi,

Seems I'm dealing with a similar issue like this:

[1]
https://www.mail-archive.com/ganglia-general@lists.sourceforge.net/msg02235.html

Config: it's a setup running Ganglia v3.6.0 with ~100 nodes, nodes are
distributed in ~20 clusters, each cluster has ~2 nodes and there's one
larger cluster with ~30 nodes, the one that is dropping/discarding metrics.

This setup has been working without issues for the last 2 years, until we
restarted the server in which "gmetad" runs. From that moment, the "big
cluster" began to fail randomly and almost all the metrics from this
cluster are discarded, the issue seems related with some type of issue with
the clock settings and TMAX value.

After reading several troubleshooting guides this is what I have found:

- LOCALTIME value for the cluster with problems, say "cluster1", seems
delayed, like ~300 seconds, that's the most important hint I have found:

<CLUSTER NAME="cluster1" LOCALTIME="1442855337" OWNER="unspecified"
LATLONG="unspecified" URL="unspecified">
<CLUSTER NAME="cluster2" LOCALTIME="1442855687" OWNER="unspecified"
LATLONG="unspecified" URL="unspecified">
<CLUSTER NAME="cluster3" LOCALTIME="1442855696" OWNER="unspecified"
LATLONG="unspecified" URL="unspecified">

All the nodes in "cluster1" has well configured the clock, I have double
checked it, in fact they're synchronized with NTP.

- doing some debug I have found this:

<HOST NAME="nodeXXX" IP="x.x.x.x" REPORTED="1442845480" TN="219" TMAX="20"
DMAX="86400" LOCATION="unspecified" GMOND_STARTED="1442842343" TAGS="">

Seems that TN is bigger than TMAX * 4 and ganglia-webfrontend reports as
"down" all the nodes from this cluster, see [1] to understand this
behaviour, so I have applied the workaround pointed in [1], patch
host_alive() function in ganglia.php, and seems it works, at least the
nodes appears in the cluster view but they have no data, seems that
"gmetad" is discarding data.

I have tried several paths: increasing "heartbeat" time_threshold,
adding host_tmax setting, etc. but it doesn't work, seems that "gmetad" is
discarding metrics and they're not finally written to .rrd files.

Any clue of what may be wrong? Thanks!

Regards,

Santi Saez
------------------------------------------------------------------------------
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to