Re: [Ganglia-general] gmetad data thread is not closing connections properly (CLOSE_WAIT connections)
Hi Javier, your issue sounds at least somewhat similar to: https://github.com/ganglia/monitor-core/issues/47 Which includes several cross referenced discussions, but no clear fixes. Are your clusters all on the same local network? On 02/15/2016 12:38 PM, Javier Villar Fernández wrote: > Hi Ganglia community, > > we are using gmetad to retrieve statistics from gmond daemons, grouping > servers by cluster: > > [gmetad.conf] > + data_source "1" y.y.y.y.y:8649 y.y.y.y:8649 > <- Snipping content _> > + data_source "15" y.y.y.y.y:8649 y.y.y.y:8649 > > ** The maximum number of hosts per data_source, that we have in our > gmeta.conf is 5. > > We have around 15 clusters working fine, but one of them, the bigger one > (around 80 servers) is failing ... What i mean by failing is, we have > blanks in this cluster graphs very often (really often indeed). What we > see, if we compare it with the other clusters is: > > 1- gmetad asks gmond (y.y.y.y:8649) for information > 2- gmond replies with a huge XML (around 4MB) containing the stats for the > whole set of servers. > 3- gmond finishes the connection with FIN+PUSH+ACK > 4- gmetad acknowledges (ACK) but it doesn't close the connection, hence we > have the data thread stuck with a CLOSE_WAIT connection in the gmetad side > 5- After two minutes (more or less), gmetad sends a FIN+ACK and the > connection is closed. > 6- After a few seconds, gmetad polls again (data thread has been freed once > the CLOSE_WAIT connection is finished) and we start the cycle again and > again. > > Doing the same analysis for the other clusters (smaller number of hosts), > we see how gmetad closes the connection properly and is polling every 10-15 > seconds. > > We really don't know what might be causing this erratic behaviour... So > far, we have unluckily tried to restart gmetad service, restart gmetad > server (host), increase the number of threads (server_threads) ... > > # /usr/sbin/gmetad -V > gmetad 3.6.0 > > # /usr/sbin/gmond -V > gmond 3.6.0 > > In the very first place, we thought it was an I/O issue, but after > increasing our disk IOPS we have realized it's not and we are pretty sure > it has to be a software/configuration issue, because the other clusters are > not affected by the same issue. > > Any help here would be much appreciated. > > Thanks in advance. > > > > -- > Site24x7 APM Insight: Get Deep Visibility into Application Performance > APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month > Monitor end-to-end web transactions and take corrective actions now > Troubleshoot faster and improve end-user experience. Signup Now! > http://pubads.g.doubleclick.net/gampad/clk?id=272487151=/4140 > > > > ___ > Ganglia-general mailing list > Ganglia-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/ganglia-general > -- Transform Data into Opportunity. Accelerate data analysis in your applications with Intel Data Analytics Acceleration Library. Click to learn more. http://pubads.g.doubleclick.net/gampad/clk?id=278785351=/4140 ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
[Ganglia-general] gmetad data thread is not closing connections properly (CLOSE_WAIT connections)
Hi Ganglia community, we are using gmetad to retrieve statistics from gmond daemons, grouping servers by cluster: [gmetad.conf] + data_source "1" y.y.y.y.y:8649 y.y.y.y:8649 <- Snipping content _> + data_source "15" y.y.y.y.y:8649 y.y.y.y:8649 ** The maximum number of hosts per data_source, that we have in our gmeta.conf is 5. We have around 15 clusters working fine, but one of them, the bigger one (around 80 servers) is failing ... What i mean by failing is, we have blanks in this cluster graphs very often (really often indeed). What we see, if we compare it with the other clusters is: 1- gmetad asks gmond (y.y.y.y:8649) for information 2- gmond replies with a huge XML (around 4MB) containing the stats for the whole set of servers. 3- gmond finishes the connection with FIN+PUSH+ACK 4- gmetad acknowledges (ACK) but it doesn't close the connection, hence we have the data thread stuck with a CLOSE_WAIT connection in the gmetad side 5- After two minutes (more or less), gmetad sends a FIN+ACK and the connection is closed. 6- After a few seconds, gmetad polls again (data thread has been freed once the CLOSE_WAIT connection is finished) and we start the cycle again and again. Doing the same analysis for the other clusters (smaller number of hosts), we see how gmetad closes the connection properly and is polling every 10-15 seconds. We really don't know what might be causing this erratic behaviour... So far, we have unluckily tried to restart gmetad service, restart gmetad server (host), increase the number of threads (server_threads) ... # /usr/sbin/gmetad -V gmetad 3.6.0 # /usr/sbin/gmond -V gmond 3.6.0 In the very first place, we thought it was an I/O issue, but after increasing our disk IOPS we have realized it's not and we are pretty sure it has to be a software/configuration issue, because the other clusters are not affected by the same issue. Any help here would be much appreciated. Thanks in advance. -- *Javier Villar * Site Reliability Engineer *CartoDB* Plaza de Callao 4, 2, 28013, Madrid, España www.cartodb.com -- Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=272487151=/4140___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general