Hi Ganglia community,

we are using gmetad to retrieve statistics from gmond daemons, grouping
servers by cluster:

[gmetad.conf]
+ data_source "1" y.y.y.y.y:8649 y.y.y.y:8649 ....
<----- Snipping content --------_>
+ data_source "15" y.y.y.y.y:8649 y.y.y.y:8649 ....

** The maximum number of hosts per data_source, that we have in our
gmeta.conf is 5.

We have around 15 clusters working fine, but one of them, the bigger one
(around 80 servers) is failing ... What i mean by failing is, we have
blanks in this cluster graphs very often (really often indeed). What we
see, if we compare it with the other clusters is:

1- gmetad asks gmond (y.y.y.y:8649) for information
2- gmond replies with a huge XML (around 4MB) containing the stats for the
whole set of servers.
3- gmond finishes the connection with FIN+PUSH+ACK
4- gmetad acknowledges (ACK) but it doesn't close the connection, hence we
have the data thread stuck with a CLOSE_WAIT connection in the gmetad side
5- After two minutes (more or less), gmetad sends a FIN+ACK and the
connection is closed.
6- After a few seconds, gmetad polls again (data thread has been freed once
the CLOSE_WAIT connection is finished) and we start the cycle again and
again.

Doing the same analysis for the other clusters (smaller number of hosts),
we see how gmetad closes the connection properly and is polling every 10-15
seconds.

We really don't know what might be causing this erratic behaviour... So
far, we have unluckily tried to restart gmetad service, restart gmetad
server (host), increase the number of threads (server_threads) ...

# /usr/sbin/gmetad -V
gmetad 3.6.0

# /usr/sbin/gmond -V
gmond 3.6.0

In the very first place, we thought it was an I/O issue, but after
increasing our disk IOPS we have realized it's not and we are pretty sure
it has to be a software/configuration issue, because the other clusters are
not affected by the same issue.

Any help here would be much appreciated.

Thanks in advance.

-- 

*Javier Villar *
Site Reliability Engineer

*CartoDB*
Plaza de Callao 4, 2, 28013, Madrid, España
www.cartodb.com
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to