Re: [Ganglia-general] missing data on large clusters
Ludmil, do you have multiple headnodes? Do they receive data from all the nodes? If yes, did you verify it (telnet to each headnode to port 8649 and count occurences of HOST... xml tag)? b. On 19 August 2015 at 12:01, Ludmil Stamboliyski l.stamboliy...@ucdn.com wrote: Thank you Dave, I've done that, but to no avail. Then I do the following - ran separate gmeta for this cluster - up to no avail. Then I thought why I do not make gmeta pull data each second: data_source example large cluster 1 127.0.0.1:port And it seems almost working now - i got data every two minutes. So clearly bottlenect is between gmond collector and gmeta - any ideas how to improve things there? Gmeta runs with rrds off, it sends data to carbon server. Also it has memcached. On вт, авг 18, 2015 at 6:04 , David Chin david.c...@drexel.edu wrote: Hi Ludmil: I had a similar problem a couple of years ago on a cluster with about 200 nodes. Currently, in a new place, I have about 120 nodes. running Ganglia 3.6.1. The difference in the new cluster was changing globals { send_metadata_interval } from 0 to 120, which you already have. The following is the globals on the aggregator gmond: globals { daemonize = yes setuid = yes user = nobody debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no allow_extra_data = yes host_dmax = 86400 /*secs. Expires (removes from web interface) hosts in 1 day */ host_tmax = 20 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no # If you are not using multicast this value should be set to something other than 0. # Otherwise if you restart aggregator gmond you will get empty graphs. 60 seconds is reasonable send_metadata_interval = 60 /*secs */ } I also increased the UDP buffer size on the aggregator, to the value set in the kernel sysctl net.core.rmem_max: udp_recv_channel { ... buffer = 4194304 } On the gmetad, I use memcached. It only runs the default 4 threads. Good luck, Dave On Tue, Aug 18, 2015 at 7:25 AM, Ludmil Stamboliyski l.stamboliy...@ucdn.com wrote: Hello, I am testing deploing ganglia to monitor our servers. I have several clusters - most of them are small ones, but I do have two large ones - with over 150 machines to monitor. The issue is that I do not receive all monitoring data from the machines in large clusters - ganglia-web reports clusters down, in graphite and in rrd I see very few points with data for machines in this large clustes - so by my calculations 2/3 of the data is lost. I am using gmond in unicast mode. Here are examples of my configs: Example of config in a monitored server: globals { daemonize = yes setuid = yes user = ganglia debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no host_dmax = 86400 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no send_metadata_interval = 60 override_hostname = !! HUMAN READABLE HOSTNAME !! } cluster { name = Example large cluster owner = unspecified latlong = unspecified url = unspecified } udp_send_channel { host = ip.addr.of.master port = 8654 ttl = 1 } udp_recv_channel { port = 8649 } tcp_accept_channel { port = 8649 } # Metric conf follows ... Example of config of gmond collector on master node: globals { daemonize = yes setuid = yes user = ganglia debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no allow_extra_data = yes host_dmax = 86400 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no send_metadata_interval = 120 } cluster { name = Example large cluster owner = unspecified latlong = unspecified url = unspecified } udp_send_channel { host = localhost port = 8654 ttl = 1 } udp_recv_channel { port = 8654 } tcp_accept_channel { port = 8654 } And here is example of my gmetad.con:f: data_source ... data_source Example large cluster localhost:8654 data_source ... server_threads 16 In logs I see a lots of Error 1 sending the modular data data_source - searched various threads but did not found anything helpful. I checked the network settings and tuned the udp accordingly - the server do not drop packets, also checked on the switch - there are no drops and loses. Load is rarely seen above 1.5 and this is 16 core server with 128GB of ram. I ran the collector and gmeta in debug and it seemed fine. I am really lost, so I'll be grateful for any help. -- ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- David Chin, Ph.D. david.c...@drexel.eduSr. Systems Administrator, URCF, Drexel U. http://www.drexel.edu/research/urcf/ https://linuxfollies.blogspot.com/ +1.215.221.4747 (mobile) https://github.com/prehensilecode
Re: [Ganglia-general] missing data on large clusters
Hi Bostjan and thank you for your time, My setup is: gmond deamons for each machine monitoring configured in unicast, 8 clusters, and one master node on which I have gmond daemon for each cluster running on different port. On the master node I have gmeta daemon configured to send data to carbon-cache and with rrds off (actually I have second gmeta for rrds which is turned off while I am investigating this issue). All hosts are present in xml and they get their field Reported changed on every run, so I think gmond collector works correctly. 2015-08-19 14:48 GMT+03:00 Bostjan Skufca bost...@a2o.si: Ludmil, do you have multiple headnodes? Do they receive data from all the nodes? If yes, did you verify it (telnet to each headnode to port 8649 and count occurences of HOST... xml tag)? b. On 19 August 2015 at 12:01, Ludmil Stamboliyski l.stamboliy...@ucdn.com wrote: Thank you Dave, I've done that, but to no avail. Then I do the following - ran separate gmeta for this cluster - up to no avail. Then I thought why I do not make gmeta pull data each second: data_source example large cluster 1 127.0.0.1:port And it seems almost working now - i got data every two minutes. So clearly bottlenect is between gmond collector and gmeta - any ideas how to improve things there? Gmeta runs with rrds off, it sends data to carbon server. Also it has memcached. On вт, авг 18, 2015 at 6:04 , David Chin david.c...@drexel.edu wrote: Hi Ludmil: I had a similar problem a couple of years ago on a cluster with about 200 nodes. Currently, in a new place, I have about 120 nodes. running Ganglia 3.6.1. The difference in the new cluster was changing globals { send_metadata_interval } from 0 to 120, which you already have. The following is the globals on the aggregator gmond: globals { daemonize = yes setuid = yes user = nobody debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no allow_extra_data = yes host_dmax = 86400 /*secs. Expires (removes from web interface) hosts in 1 day */ host_tmax = 20 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no # If you are not using multicast this value should be set to something other than 0. # Otherwise if you restart aggregator gmond you will get empty graphs. 60 seconds is reasonable send_metadata_interval = 60 /*secs */ } I also increased the UDP buffer size on the aggregator, to the value set in the kernel sysctl net.core.rmem_max: udp_recv_channel { ... buffer = 4194304 } On the gmetad, I use memcached. It only runs the default 4 threads. Good luck, Dave On Tue, Aug 18, 2015 at 7:25 AM, Ludmil Stamboliyski l.stamboliy...@ucdn.com wrote: Hello, I am testing deploing ganglia to monitor our servers. I have several clusters - most of them are small ones, but I do have two large ones - with over 150 machines to monitor. The issue is that I do not receive all monitoring data from the machines in large clusters - ganglia-web reports clusters down, in graphite and in rrd I see very few points with data for machines in this large clustes - so by my calculations 2/3 of the data is lost. I am using gmond in unicast mode. Here are examples of my configs: Example of config in a monitored server: globals { daemonize = yes setuid = yes user = ganglia debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no host_dmax = 86400 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no send_metadata_interval = 60 override_hostname = !! HUMAN READABLE HOSTNAME !! } cluster { name = Example large cluster owner = unspecified latlong = unspecified url = unspecified } udp_send_channel { host = ip.addr.of.master port = 8654 ttl = 1 } udp_recv_channel { port = 8649 } tcp_accept_channel { port = 8649 } # Metric conf follows ... Example of config of gmond collector on master node: globals { daemonize = yes setuid = yes user = ganglia debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no allow_extra_data = yes host_dmax = 86400 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no send_metadata_interval = 120 } cluster { name = Example large cluster owner = unspecified latlong = unspecified url = unspecified } udp_send_channel { host = localhost port = 8654 ttl = 1 } udp_recv_channel { port = 8654 } tcp_accept_channel { port = 8654 } And here is example of my gmetad.con:f: data_source ... data_source Example large cluster localhost:8654 data_source ... server_threads 16 In logs I see a lots of Error 1 sending the modular data data_source - searched various threads but did not found anything helpful. I checked the network settings and tuned the udp accordingly - the
Re: [Ganglia-general] missing data on large clusters
Does increasing gmetad's debug level (runs in foreground) yield anything useful? On 19 August 2015 at 21:15, Ludmil Stamboliyski l.stamboliy...@ucdn.com wrote: Hi Bostjan and thank you for your time, My setup is: gmond deamons for each machine monitoring configured in unicast, 8 clusters, and one master node on which I have gmond daemon for each cluster running on different port. On the master node I have gmeta daemon configured to send data to carbon-cache and with rrds off (actually I have second gmeta for rrds which is turned off while I am investigating this issue). All hosts are present in xml and they get their field Reported changed on every run, so I think gmond collector works correctly. 2015-08-19 14:48 GMT+03:00 Bostjan Skufca bost...@a2o.si: Ludmil, do you have multiple headnodes? Do they receive data from all the nodes? If yes, did you verify it (telnet to each headnode to port 8649 and count occurences of HOST... xml tag)? b. On 19 August 2015 at 12:01, Ludmil Stamboliyski l.stamboliy...@ucdn.com wrote: Thank you Dave, I've done that, but to no avail. Then I do the following - ran separate gmeta for this cluster - up to no avail. Then I thought why I do not make gmeta pull data each second: data_source example large cluster 1 127.0.0.1:port And it seems almost working now - i got data every two minutes. So clearly bottlenect is between gmond collector and gmeta - any ideas how to improve things there? Gmeta runs with rrds off, it sends data to carbon server. Also it has memcached. On вт, авг 18, 2015 at 6:04 , David Chin david.c...@drexel.edu wrote: Hi Ludmil: I had a similar problem a couple of years ago on a cluster with about 200 nodes. Currently, in a new place, I have about 120 nodes. running Ganglia 3.6.1. The difference in the new cluster was changing globals { send_metadata_interval } from 0 to 120, which you already have. The following is the globals on the aggregator gmond: globals { daemonize = yes setuid = yes user = nobody debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no allow_extra_data = yes host_dmax = 86400 /*secs. Expires (removes from web interface) hosts in 1 day */ host_tmax = 20 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no # If you are not using multicast this value should be set to something other than 0. # Otherwise if you restart aggregator gmond you will get empty graphs. 60 seconds is reasonable send_metadata_interval = 60 /*secs */ } I also increased the UDP buffer size on the aggregator, to the value set in the kernel sysctl net.core.rmem_max: udp_recv_channel { ... buffer = 4194304 } On the gmetad, I use memcached. It only runs the default 4 threads. Good luck, Dave On Tue, Aug 18, 2015 at 7:25 AM, Ludmil Stamboliyski l.stamboliy...@ucdn.com wrote: Hello, I am testing deploing ganglia to monitor our servers. I have several clusters - most of them are small ones, but I do have two large ones - with over 150 machines to monitor. The issue is that I do not receive all monitoring data from the machines in large clusters - ganglia-web reports clusters down, in graphite and in rrd I see very few points with data for machines in this large clustes - so by my calculations 2/3 of the data is lost. I am using gmond in unicast mode. Here are examples of my configs: Example of config in a monitored server: globals { daemonize = yes setuid = yes user = ganglia debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no host_dmax = 86400 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no send_metadata_interval = 60 override_hostname = !! HUMAN READABLE HOSTNAME !! } cluster { name = Example large cluster owner = unspecified latlong = unspecified url = unspecified } udp_send_channel { host = ip.addr.of.master port = 8654 ttl = 1 } udp_recv_channel { port = 8649 } tcp_accept_channel { port = 8649 } # Metric conf follows ... Example of config of gmond collector on master node: globals { daemonize = yes setuid = yes user = ganglia debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no allow_extra_data = yes host_dmax = 86400 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no send_metadata_interval = 120 } cluster { name = Example large cluster owner = unspecified latlong = unspecified url = unspecified } udp_send_channel { host = localhost port = 8654 ttl = 1 } udp_recv_channel { port = 8654 } tcp_accept_channel { port = 8654 } And here is example of my gmetad.con:f: data_source ... data_source Example large cluster localhost:8654 data_source ... server_threads 16 In logs I see
Re: [Ganglia-general] missing data on large clusters
Ok guys, thanks to your help we could count this resolved. For anyone who wants to use graphite and carbon-cache - here is a peace of advice - run separate gmeta daemon dedicated only to feeding carbon. The key is to set up carbon and gmeta to communicate by udp - that gave me tripple increase of received metrics. I am still diggin into what the hell is wrong with ubuntu tcp stack, but alas it is working fine with udp. 2015-08-19 23:34 GMT+03:00 Ludmil Stamboliyski l.stamboliy...@ucdn.com: So... I got the culprit - it turns out that carbon-cache is slowing down the whole gmeta daemon... Now with pool interval 1 and rrd it finally became stable and began to load the machine as expected. Next thing to answer is why is carbon so slow? 2015-08-19 22:59 GMT+03:00 Bostjan Skufca bost...@a2o.si: Does increasing gmetad's debug level (runs in foreground) yield anything useful? On 19 August 2015 at 21:15, Ludmil Stamboliyski l.stamboliy...@ucdn.com wrote: Hi Bostjan and thank you for your time, My setup is: gmond deamons for each machine monitoring configured in unicast, 8 clusters, and one master node on which I have gmond daemon for each cluster running on different port. On the master node I have gmeta daemon configured to send data to carbon-cache and with rrds off (actually I have second gmeta for rrds which is turned off while I am investigating this issue). All hosts are present in xml and they get their field Reported changed on every run, so I think gmond collector works correctly. 2015-08-19 14:48 GMT+03:00 Bostjan Skufca bost...@a2o.si: Ludmil, do you have multiple headnodes? Do they receive data from all the nodes? If yes, did you verify it (telnet to each headnode to port 8649 and count occurences of HOST... xml tag)? b. On 19 August 2015 at 12:01, Ludmil Stamboliyski l.stamboliy...@ucdn.com wrote: Thank you Dave, I've done that, but to no avail. Then I do the following - ran separate gmeta for this cluster - up to no avail. Then I thought why I do not make gmeta pull data each second: data_source example large cluster 1 127.0.0.1:port And it seems almost working now - i got data every two minutes. So clearly bottlenect is between gmond collector and gmeta - any ideas how to improve things there? Gmeta runs with rrds off, it sends data to carbon server. Also it has memcached. On вт, авг 18, 2015 at 6:04 , David Chin david.c...@drexel.edu wrote: Hi Ludmil: I had a similar problem a couple of years ago on a cluster with about 200 nodes. Currently, in a new place, I have about 120 nodes. running Ganglia 3.6.1. The difference in the new cluster was changing globals { send_metadata_interval } from 0 to 120, which you already have. The following is the globals on the aggregator gmond: globals { daemonize = yes setuid = yes user = nobody debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no allow_extra_data = yes host_dmax = 86400 /*secs. Expires (removes from web interface) hosts in 1 day */ host_tmax = 20 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no # If you are not using multicast this value should be set to something other than 0. # Otherwise if you restart aggregator gmond you will get empty graphs. 60 seconds is reasonable send_metadata_interval = 60 /*secs */ } I also increased the UDP buffer size on the aggregator, to the value set in the kernel sysctl net.core.rmem_max: udp_recv_channel { ... buffer = 4194304 } On the gmetad, I use memcached. It only runs the default 4 threads. Good luck, Dave On Tue, Aug 18, 2015 at 7:25 AM, Ludmil Stamboliyski l.stamboliy...@ucdn.com wrote: Hello, I am testing deploing ganglia to monitor our servers. I have several clusters - most of them are small ones, but I do have two large ones - with over 150 machines to monitor. The issue is that I do not receive all monitoring data from the machines in large clusters - ganglia-web reports clusters down, in graphite and in rrd I see very few points with data for machines in this large clustes - so by my calculations 2/3 of the data is lost. I am using gmond in unicast mode. Here are examples of my configs: Example of config in a monitored server: globals { daemonize = yes setuid = yes user = ganglia debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no host_dmax = 86400 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no send_metadata_interval = 60 override_hostname = !! HUMAN READABLE HOSTNAME !! } cluster { name = Example large cluster owner = unspecified latlong = unspecified url = unspecified } udp_send_channel { host =
Re: [Ganglia-general] missing data on large clusters
Thank you Dave, I've done that, but to no avail. Then I do the following - ran separate gmeta for this cluster - up to no avail. Then I thought why I do not make gmeta pull data each second: data_source example large cluster 1 127.0.0.1:port And it seems almost working now - i got data every two minutes. So clearly bottlenect is between gmond collector and gmeta - any ideas how to improve things there? Gmeta runs with rrds off, it sends data to carbon server. Also it has memcached. On вт, авг 18, 2015 at 6:04 , David Chin david.c...@drexel.edu wrote: Hi Ludmil: I had a similar problem a couple of years ago on a cluster with about 200 nodes. Currently, in a new place, I have about 120 nodes. running Ganglia 3.6.1. The difference in the new cluster was changing globals { send_metadata_interval } from 0 to 120, which you already have. The following is the globals on the aggregator gmond: globals { daemonize = yes setuid = yes user = nobody debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no allow_extra_data = yes host_dmax = 86400 /*secs. Expires (removes from web interface) hosts in 1 day */ host_tmax = 20 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no # If you are not using multicast this value should be set to something other than 0. # Otherwise if you restart aggregator gmond you will get empty graphs. 60 seconds is reasonable send_metadata_interval = 60 /*secs */ } I also increased the UDP buffer size on the aggregator, to the value set in the kernel sysctl net.core.rmem_max: udp_recv_channel { ... buffer = 4194304 } On the gmetad, I use memcached. It only runs the default 4 threads. Good luck, Dave On Tue, Aug 18, 2015 at 7:25 AM, Ludmil Stamboliyski l.stamboliy...@ucdn.com wrote: Hello, I am testing deploing ganglia to monitor our servers. I have several clusters - most of them are small ones, but I do have two large ones - with over 150 machines to monitor. The issue is that I do not receive all monitoring data from the machines in large clusters - ganglia-web reports clusters down, in graphite and in rrd I see very few points with data for machines in this large clustes - so by my calculations 2/3 of the data is lost. I am using gmond in unicast mode. Here are examples of my configs: Example of config in a monitored server: globals { daemonize = yes setuid = yes user = ganglia debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no host_dmax = 86400 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no send_metadata_interval = 60 override_hostname = !! HUMAN READABLE HOSTNAME !! } cluster { name = Example large cluster owner = unspecified latlong = unspecified url = unspecified } udp_send_channel { host = ip.addr.of.master port = 8654 ttl = 1 } udp_recv_channel { port = 8649 } tcp_accept_channel { port = 8649 } # Metric conf follows ... Example of config of gmond collector on master node: globals { daemonize = yes setuid = yes user = ganglia debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no allow_extra_data = yes host_dmax = 86400 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no send_metadata_interval = 120 } cluster { name = Example large cluster owner = unspecified latlong = unspecified url = unspecified } udp_send_channel { host = localhost port = 8654 ttl = 1 } udp_recv_channel { port = 8654 } tcp_accept_channel { port = 8654 } And here is example of my gmetad.con:f: data_source ... data_source Example large cluster localhost:8654 data_source ... server_threads 16 In logs I see a lots of Error 1 sending the modular data data_source - searched various threads but did not found anything helpful. I checked the network settings and tuned the udp accordingly - the server do not drop packets, also checked on the switch - there are no drops and loses. Load is rarely seen above 1.5 and this is 16 core server with 128GB of ram. I ran the collector and gmeta in debug and it seemed fine. I am really lost, so I'll be grateful for any help. -- ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- David Chin, Ph.D. david.c...@drexel.eduSr. Systems Administrator, URCF, Drexel U. http://www.drexel.edu/research/urcf/ https://linuxfollies.blogspot.com/ +1.215.221.4747 (mobile) https://github.com/prehensilecode -- ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
[Ganglia-general] missing data on large clusters
Hello, I am testing deploing ganglia to monitor our servers. I have several clusters - most of them are small ones, but I do have two large ones - with over 150 machines to monitor. The issue is that I do not receive all monitoring data from the machines in large clusters - ganglia-web reports clusters down, in graphite and in rrd I see very few points with data for machines in this large clustes - so by my calculations 2/3 of the data is lost. I am using gmond in unicast mode. Here are examples of my configs: Example of config in a monitored server: globals { daemonize = yes setuid = yes user = ganglia debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no host_dmax = 86400 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no send_metadata_interval = 60 override_hostname = !! HUMAN READABLE HOSTNAME !! } cluster { name = Example large cluster owner = unspecified latlong = unspecified url = unspecified } udp_send_channel { host = ip.addr.of.master port = 8654 ttl = 1 } udp_recv_channel { port = 8649 } tcp_accept_channel { port = 8649 } # Metric conf follows ... Example of config of gmond collector on master node: globals { daemonize = yes setuid = yes user = ganglia debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no allow_extra_data = yes host_dmax = 86400 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no send_metadata_interval = 120 } cluster { name = Example large cluster owner = unspecified latlong = unspecified url = unspecified } udp_send_channel { host = localhost port = 8654 ttl = 1 } udp_recv_channel { port = 8654 } tcp_accept_channel { port = 8654 } And here is example of my gmetad.con:f: data_source ... data_source Example large cluster localhost:8654 data_source ... server_threads 16 In logs I see a lots of Error 1 sending the modular data data_source - searched various threads but did not found anything helpful. I checked the network settings and tuned the udp accordingly - the server do not drop packets, also checked on the switch - there are no drops and loses. Load is rarely seen above 1.5 and this is 16 core server with 128GB of ram. I ran the collector and gmeta in debug and it seemed fine. I am really lost, so I'll be grateful for any help. -- ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] missing data on large clusters
Hi Ludmil: I had a similar problem a couple of years ago on a cluster with about 200 nodes. Currently, in a new place, I have about 120 nodes. running Ganglia 3.6.1. The difference in the new cluster was changing globals { send_metadata_interval } from 0 to 120, which you already have. The following is the globals on the aggregator gmond: globals { daemonize = yes setuid = yes user = nobody debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no allow_extra_data = yes host_dmax = 86400 /*secs. Expires (removes from web interface) hosts in 1 day */ host_tmax = 20 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no # If you are not using multicast this value should be set to something other than 0. # Otherwise if you restart aggregator gmond you will get empty graphs. 60 seconds is reasonable send_metadata_interval = 60 /*secs */ } I also increased the UDP buffer size on the aggregator, to the value set in the kernel sysctl net.core.rmem_max: udp_recv_channel { ... buffer = 4194304 } On the gmetad, I use memcached. It only runs the default 4 threads. Good luck, Dave On Tue, Aug 18, 2015 at 7:25 AM, Ludmil Stamboliyski l.stamboliy...@ucdn.com wrote: Hello, I am testing deploing ganglia to monitor our servers. I have several clusters - most of them are small ones, but I do have two large ones - with over 150 machines to monitor. The issue is that I do not receive all monitoring data from the machines in large clusters - ganglia-web reports clusters down, in graphite and in rrd I see very few points with data for machines in this large clustes - so by my calculations 2/3 of the data is lost. I am using gmond in unicast mode. Here are examples of my configs: Example of config in a monitored server: globals { daemonize = yes setuid = yes user = ganglia debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no host_dmax = 86400 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no send_metadata_interval = 60 override_hostname = !! HUMAN READABLE HOSTNAME !! } cluster { name = Example large cluster owner = unspecified latlong = unspecified url = unspecified } udp_send_channel { host = ip.addr.of.master port = 8654 ttl = 1 } udp_recv_channel { port = 8649 } tcp_accept_channel { port = 8649 } # Metric conf follows ... Example of config of gmond collector on master node: globals { daemonize = yes setuid = yes user = ganglia debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no allow_extra_data = yes host_dmax = 86400 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no send_metadata_interval = 120 } cluster { name = Example large cluster owner = unspecified latlong = unspecified url = unspecified } udp_send_channel { host = localhost port = 8654 ttl = 1 } udp_recv_channel { port = 8654 } tcp_accept_channel { port = 8654 } And here is example of my gmetad.con:f: data_source ... data_source Example large cluster localhost:8654 data_source ... server_threads 16 In logs I see a lots of Error 1 sending the modular data data_source - searched various threads but did not found anything helpful. I checked the network settings and tuned the udp accordingly - the server do not drop packets, also checked on the switch - there are no drops and loses. Load is rarely seen above 1.5 and this is 16 core server with 128GB of ram. I ran the collector and gmeta in debug and it seemed fine. I am really lost, so I'll be grateful for any help. -- ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- David Chin, Ph.D. david.c...@drexel.eduSr. Systems Administrator, URCF, Drexel U. http://www.drexel.edu/research/urcf/ https://linuxfollies.blogspot.com/ +1.215.221.4747 (mobile) https://github.com/prehensilecode -- ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general