Re: [Ganglia-general] gmetad data thread is not closing connections properly (CLOSE_WAIT connections)

2016-03-23 Thread Chris Burroughs
Hi Javier, your issue sounds at least somewhat similar to:
https://github.com/ganglia/monitor-core/issues/47

Which includes several cross referenced discussions, but no clear fixes. 
  Are your clusters all on the same local network?

On 02/15/2016 12:38 PM, Javier Villar Fernández wrote:
> Hi Ganglia community,
>
> we are using gmetad to retrieve statistics from gmond daemons, grouping
> servers by cluster:
>
> [gmetad.conf]
> + data_source "1" y.y.y.y.y:8649 y.y.y.y:8649 
> <- Snipping content _>
> + data_source "15" y.y.y.y.y:8649 y.y.y.y:8649 
>
> ** The maximum number of hosts per data_source, that we have in our
> gmeta.conf is 5.
>
> We have around 15 clusters working fine, but one of them, the bigger one
> (around 80 servers) is failing ... What i mean by failing is, we have
> blanks in this cluster graphs very often (really often indeed). What we
> see, if we compare it with the other clusters is:
>
> 1- gmetad asks gmond (y.y.y.y:8649) for information
> 2- gmond replies with a huge XML (around 4MB) containing the stats for the
> whole set of servers.
> 3- gmond finishes the connection with FIN+PUSH+ACK
> 4- gmetad acknowledges (ACK) but it doesn't close the connection, hence we
> have the data thread stuck with a CLOSE_WAIT connection in the gmetad side
> 5- After two minutes (more or less), gmetad sends a FIN+ACK and the
> connection is closed.
> 6- After a few seconds, gmetad polls again (data thread has been freed once
> the CLOSE_WAIT connection is finished) and we start the cycle again and
> again.
>
> Doing the same analysis for the other clusters (smaller number of hosts),
> we see how gmetad closes the connection properly and is polling every 10-15
> seconds.
>
> We really don't know what might be causing this erratic behaviour... So
> far, we have unluckily tried to restart gmetad service, restart gmetad
> server (host), increase the number of threads (server_threads) ...
>
> # /usr/sbin/gmetad -V
> gmetad 3.6.0
>
> # /usr/sbin/gmond -V
> gmond 3.6.0
>
> In the very first place, we thought it was an I/O issue, but after
> increasing our disk IOPS we have realized it's not and we are pretty sure
> it has to be a software/configuration issue, because the other clusters are
> not affected by the same issue.
>
> Any help here would be much appreciated.
>
> Thanks in advance.
>
>
>
> --
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=272487151=/4140
>
>
>
> ___
> Ganglia-general mailing list
> Ganglia-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>


--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351=/4140
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


[Ganglia-general] gmetad data thread is not closing connections properly (CLOSE_WAIT connections)

2016-02-15 Thread Javier Villar Fernández
Hi Ganglia community,

we are using gmetad to retrieve statistics from gmond daemons, grouping
servers by cluster:

[gmetad.conf]
+ data_source "1" y.y.y.y.y:8649 y.y.y.y:8649 
<- Snipping content _>
+ data_source "15" y.y.y.y.y:8649 y.y.y.y:8649 

** The maximum number of hosts per data_source, that we have in our
gmeta.conf is 5.

We have around 15 clusters working fine, but one of them, the bigger one
(around 80 servers) is failing ... What i mean by failing is, we have
blanks in this cluster graphs very often (really often indeed). What we
see, if we compare it with the other clusters is:

1- gmetad asks gmond (y.y.y.y:8649) for information
2- gmond replies with a huge XML (around 4MB) containing the stats for the
whole set of servers.
3- gmond finishes the connection with FIN+PUSH+ACK
4- gmetad acknowledges (ACK) but it doesn't close the connection, hence we
have the data thread stuck with a CLOSE_WAIT connection in the gmetad side
5- After two minutes (more or less), gmetad sends a FIN+ACK and the
connection is closed.
6- After a few seconds, gmetad polls again (data thread has been freed once
the CLOSE_WAIT connection is finished) and we start the cycle again and
again.

Doing the same analysis for the other clusters (smaller number of hosts),
we see how gmetad closes the connection properly and is polling every 10-15
seconds.

We really don't know what might be causing this erratic behaviour... So
far, we have unluckily tried to restart gmetad service, restart gmetad
server (host), increase the number of threads (server_threads) ...

# /usr/sbin/gmetad -V
gmetad 3.6.0

# /usr/sbin/gmond -V
gmond 3.6.0

In the very first place, we thought it was an I/O issue, but after
increasing our disk IOPS we have realized it's not and we are pretty sure
it has to be a software/configuration issue, because the other clusters are
not affected by the same issue.

Any help here would be much appreciated.

Thanks in advance.

-- 

*Javier Villar *
Site Reliability Engineer

*CartoDB*
Plaza de Callao 4, 2, 28013, Madrid, España
www.cartodb.com
--
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151=/4140___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general