Re: [Ganglia-general] missing data on large clusters
Ludmil, do you have multiple headnodes? Do they receive data from all the nodes? If yes, did you verify it (telnet to each headnode to port 8649 and count occurences of HOST... xml tag)? b. On 19 August 2015 at 12:01, Ludmil Stamboliyski l.stamboliy...@ucdn.com wrote: Thank you Dave, I've done that, but to no avail. Then I do the following - ran separate gmeta for this cluster - up to no avail. Then I thought why I do not make gmeta pull data each second: data_source example large cluster 1 127.0.0.1:port And it seems almost working now - i got data every two minutes. So clearly bottlenect is between gmond collector and gmeta - any ideas how to improve things there? Gmeta runs with rrds off, it sends data to carbon server. Also it has memcached. On вт, авг 18, 2015 at 6:04 , David Chin david.c...@drexel.edu wrote: Hi Ludmil: I had a similar problem a couple of years ago on a cluster with about 200 nodes. Currently, in a new place, I have about 120 nodes. running Ganglia 3.6.1. The difference in the new cluster was changing globals { send_metadata_interval } from 0 to 120, which you already have. The following is the globals on the aggregator gmond: globals { daemonize = yes setuid = yes user = nobody debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no allow_extra_data = yes host_dmax = 86400 /*secs. Expires (removes from web interface) hosts in 1 day */ host_tmax = 20 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no # If you are not using multicast this value should be set to something other than 0. # Otherwise if you restart aggregator gmond you will get empty graphs. 60 seconds is reasonable send_metadata_interval = 60 /*secs */ } I also increased the UDP buffer size on the aggregator, to the value set in the kernel sysctl net.core.rmem_max: udp_recv_channel { ... buffer = 4194304 } On the gmetad, I use memcached. It only runs the default 4 threads. Good luck, Dave On Tue, Aug 18, 2015 at 7:25 AM, Ludmil Stamboliyski l.stamboliy...@ucdn.com wrote: Hello, I am testing deploing ganglia to monitor our servers. I have several clusters - most of them are small ones, but I do have two large ones - with over 150 machines to monitor. The issue is that I do not receive all monitoring data from the machines in large clusters - ganglia-web reports clusters down, in graphite and in rrd I see very few points with data for machines in this large clustes - so by my calculations 2/3 of the data is lost. I am using gmond in unicast mode. Here are examples of my configs: Example of config in a monitored server: globals { daemonize = yes setuid = yes user = ganglia debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no host_dmax = 86400 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no send_metadata_interval = 60 override_hostname = !! HUMAN READABLE HOSTNAME !! } cluster { name = Example large cluster owner = unspecified latlong = unspecified url = unspecified } udp_send_channel { host = ip.addr.of.master port = 8654 ttl = 1 } udp_recv_channel { port = 8649 } tcp_accept_channel { port = 8649 } # Metric conf follows ... Example of config of gmond collector on master node: globals { daemonize = yes setuid = yes user = ganglia debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no allow_extra_data = yes host_dmax = 86400 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no send_metadata_interval = 120 } cluster { name = Example large cluster owner = unspecified latlong = unspecified url = unspecified } udp_send_channel { host = localhost port = 8654 ttl = 1 } udp_recv_channel { port = 8654 } tcp_accept_channel { port = 8654 } And here is example of my gmetad.con:f: data_source ... data_source Example large cluster localhost:8654 data_source ... server_threads 16 In logs I see a lots of Error 1 sending the modular data data_source - searched various threads but did not found anything helpful. I checked the network settings and tuned the udp accordingly - the server do not drop packets, also checked on the switch - there are no drops and loses. Load is rarely seen above 1.5 and this is 16 core server with 128GB of ram. I ran the collector and gmeta in debug and it seemed fine. I am really lost, so I'll be grateful for any help. -- ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- David Chin, Ph.D. david.c...@drexel.eduSr. Systems Administrator, URCF, Drexel U. http://www.drexel.edu/research/urcf/ https://linuxfollies.blogspot.com/ +1.215.221.4747 (mobile) https://github.com/prehensilecode
Re: [Ganglia-general] missing data on large clusters
Hi Bostjan and thank you for your time, My setup is: gmond deamons for each machine monitoring configured in unicast, 8 clusters, and one master node on which I have gmond daemon for each cluster running on different port. On the master node I have gmeta daemon configured to send data to carbon-cache and with rrds off (actually I have second gmeta for rrds which is turned off while I am investigating this issue). All hosts are present in xml and they get their field Reported changed on every run, so I think gmond collector works correctly. 2015-08-19 14:48 GMT+03:00 Bostjan Skufca bost...@a2o.si: Ludmil, do you have multiple headnodes? Do they receive data from all the nodes? If yes, did you verify it (telnet to each headnode to port 8649 and count occurences of HOST... xml tag)? b. On 19 August 2015 at 12:01, Ludmil Stamboliyski l.stamboliy...@ucdn.com wrote: Thank you Dave, I've done that, but to no avail. Then I do the following - ran separate gmeta for this cluster - up to no avail. Then I thought why I do not make gmeta pull data each second: data_source example large cluster 1 127.0.0.1:port And it seems almost working now - i got data every two minutes. So clearly bottlenect is between gmond collector and gmeta - any ideas how to improve things there? Gmeta runs with rrds off, it sends data to carbon server. Also it has memcached. On вт, авг 18, 2015 at 6:04 , David Chin david.c...@drexel.edu wrote: Hi Ludmil: I had a similar problem a couple of years ago on a cluster with about 200 nodes. Currently, in a new place, I have about 120 nodes. running Ganglia 3.6.1. The difference in the new cluster was changing globals { send_metadata_interval } from 0 to 120, which you already have. The following is the globals on the aggregator gmond: globals { daemonize = yes setuid = yes user = nobody debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no allow_extra_data = yes host_dmax = 86400 /*secs. Expires (removes from web interface) hosts in 1 day */ host_tmax = 20 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no # If you are not using multicast this value should be set to something other than 0. # Otherwise if you restart aggregator gmond you will get empty graphs. 60 seconds is reasonable send_metadata_interval = 60 /*secs */ } I also increased the UDP buffer size on the aggregator, to the value set in the kernel sysctl net.core.rmem_max: udp_recv_channel { ... buffer = 4194304 } On the gmetad, I use memcached. It only runs the default 4 threads. Good luck, Dave On Tue, Aug 18, 2015 at 7:25 AM, Ludmil Stamboliyski l.stamboliy...@ucdn.com wrote: Hello, I am testing deploing ganglia to monitor our servers. I have several clusters - most of them are small ones, but I do have two large ones - with over 150 machines to monitor. The issue is that I do not receive all monitoring data from the machines in large clusters - ganglia-web reports clusters down, in graphite and in rrd I see very few points with data for machines in this large clustes - so by my calculations 2/3 of the data is lost. I am using gmond in unicast mode. Here are examples of my configs: Example of config in a monitored server: globals { daemonize = yes setuid = yes user = ganglia debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no host_dmax = 86400 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no send_metadata_interval = 60 override_hostname = !! HUMAN READABLE HOSTNAME !! } cluster { name = Example large cluster owner = unspecified latlong = unspecified url = unspecified } udp_send_channel { host = ip.addr.of.master port = 8654 ttl = 1 } udp_recv_channel { port = 8649 } tcp_accept_channel { port = 8649 } # Metric conf follows ... Example of config of gmond collector on master node: globals { daemonize = yes setuid = yes user = ganglia debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no allow_extra_data = yes host_dmax = 86400 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no send_metadata_interval = 120 } cluster { name = Example large cluster owner = unspecified latlong = unspecified url = unspecified } udp_send_channel { host = localhost port = 8654 ttl = 1 } udp_recv_channel { port = 8654 } tcp_accept_channel { port = 8654 } And here is example of my gmetad.con:f: data_source ... data_source Example large cluster localhost:8654 data_source ... server_threads 16 In logs I see a lots of Error 1 sending the modular data data_source - searched various threads but did not found anything helpful. I checked the network settings and tuned the udp accordingly - the
Re: [Ganglia-general] missing data on large clusters
Does increasing gmetad's debug level (runs in foreground) yield anything useful? On 19 August 2015 at 21:15, Ludmil Stamboliyski l.stamboliy...@ucdn.com wrote: Hi Bostjan and thank you for your time, My setup is: gmond deamons for each machine monitoring configured in unicast, 8 clusters, and one master node on which I have gmond daemon for each cluster running on different port. On the master node I have gmeta daemon configured to send data to carbon-cache and with rrds off (actually I have second gmeta for rrds which is turned off while I am investigating this issue). All hosts are present in xml and they get their field Reported changed on every run, so I think gmond collector works correctly. 2015-08-19 14:48 GMT+03:00 Bostjan Skufca bost...@a2o.si: Ludmil, do you have multiple headnodes? Do they receive data from all the nodes? If yes, did you verify it (telnet to each headnode to port 8649 and count occurences of HOST... xml tag)? b. On 19 August 2015 at 12:01, Ludmil Stamboliyski l.stamboliy...@ucdn.com wrote: Thank you Dave, I've done that, but to no avail. Then I do the following - ran separate gmeta for this cluster - up to no avail. Then I thought why I do not make gmeta pull data each second: data_source example large cluster 1 127.0.0.1:port And it seems almost working now - i got data every two minutes. So clearly bottlenect is between gmond collector and gmeta - any ideas how to improve things there? Gmeta runs with rrds off, it sends data to carbon server. Also it has memcached. On вт, авг 18, 2015 at 6:04 , David Chin david.c...@drexel.edu wrote: Hi Ludmil: I had a similar problem a couple of years ago on a cluster with about 200 nodes. Currently, in a new place, I have about 120 nodes. running Ganglia 3.6.1. The difference in the new cluster was changing globals { send_metadata_interval } from 0 to 120, which you already have. The following is the globals on the aggregator gmond: globals { daemonize = yes setuid = yes user = nobody debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no allow_extra_data = yes host_dmax = 86400 /*secs. Expires (removes from web interface) hosts in 1 day */ host_tmax = 20 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no # If you are not using multicast this value should be set to something other than 0. # Otherwise if you restart aggregator gmond you will get empty graphs. 60 seconds is reasonable send_metadata_interval = 60 /*secs */ } I also increased the UDP buffer size on the aggregator, to the value set in the kernel sysctl net.core.rmem_max: udp_recv_channel { ... buffer = 4194304 } On the gmetad, I use memcached. It only runs the default 4 threads. Good luck, Dave On Tue, Aug 18, 2015 at 7:25 AM, Ludmil Stamboliyski l.stamboliy...@ucdn.com wrote: Hello, I am testing deploing ganglia to monitor our servers. I have several clusters - most of them are small ones, but I do have two large ones - with over 150 machines to monitor. The issue is that I do not receive all monitoring data from the machines in large clusters - ganglia-web reports clusters down, in graphite and in rrd I see very few points with data for machines in this large clustes - so by my calculations 2/3 of the data is lost. I am using gmond in unicast mode. Here are examples of my configs: Example of config in a monitored server: globals { daemonize = yes setuid = yes user = ganglia debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no host_dmax = 86400 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no send_metadata_interval = 60 override_hostname = !! HUMAN READABLE HOSTNAME !! } cluster { name = Example large cluster owner = unspecified latlong = unspecified url = unspecified } udp_send_channel { host = ip.addr.of.master port = 8654 ttl = 1 } udp_recv_channel { port = 8649 } tcp_accept_channel { port = 8649 } # Metric conf follows ... Example of config of gmond collector on master node: globals { daemonize = yes setuid = yes user = ganglia debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no allow_extra_data = yes host_dmax = 86400 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no send_metadata_interval = 120 } cluster { name = Example large cluster owner = unspecified latlong = unspecified url = unspecified } udp_send_channel { host = localhost port = 8654 ttl = 1 } udp_recv_channel { port = 8654 } tcp_accept_channel { port = 8654 } And here is example of my gmetad.con:f: data_source ... data_source Example large cluster localhost:8654 data_source ... server_threads 16 In logs I see
Re: [Ganglia-general] missing data on large clusters
Ok guys, thanks to your help we could count this resolved. For anyone who wants to use graphite and carbon-cache - here is a peace of advice - run separate gmeta daemon dedicated only to feeding carbon. The key is to set up carbon and gmeta to communicate by udp - that gave me tripple increase of received metrics. I am still diggin into what the hell is wrong with ubuntu tcp stack, but alas it is working fine with udp. 2015-08-19 23:34 GMT+03:00 Ludmil Stamboliyski l.stamboliy...@ucdn.com: So... I got the culprit - it turns out that carbon-cache is slowing down the whole gmeta daemon... Now with pool interval 1 and rrd it finally became stable and began to load the machine as expected. Next thing to answer is why is carbon so slow? 2015-08-19 22:59 GMT+03:00 Bostjan Skufca bost...@a2o.si: Does increasing gmetad's debug level (runs in foreground) yield anything useful? On 19 August 2015 at 21:15, Ludmil Stamboliyski l.stamboliy...@ucdn.com wrote: Hi Bostjan and thank you for your time, My setup is: gmond deamons for each machine monitoring configured in unicast, 8 clusters, and one master node on which I have gmond daemon for each cluster running on different port. On the master node I have gmeta daemon configured to send data to carbon-cache and with rrds off (actually I have second gmeta for rrds which is turned off while I am investigating this issue). All hosts are present in xml and they get their field Reported changed on every run, so I think gmond collector works correctly. 2015-08-19 14:48 GMT+03:00 Bostjan Skufca bost...@a2o.si: Ludmil, do you have multiple headnodes? Do they receive data from all the nodes? If yes, did you verify it (telnet to each headnode to port 8649 and count occurences of HOST... xml tag)? b. On 19 August 2015 at 12:01, Ludmil Stamboliyski l.stamboliy...@ucdn.com wrote: Thank you Dave, I've done that, but to no avail. Then I do the following - ran separate gmeta for this cluster - up to no avail. Then I thought why I do not make gmeta pull data each second: data_source example large cluster 1 127.0.0.1:port And it seems almost working now - i got data every two minutes. So clearly bottlenect is between gmond collector and gmeta - any ideas how to improve things there? Gmeta runs with rrds off, it sends data to carbon server. Also it has memcached. On вт, авг 18, 2015 at 6:04 , David Chin david.c...@drexel.edu wrote: Hi Ludmil: I had a similar problem a couple of years ago on a cluster with about 200 nodes. Currently, in a new place, I have about 120 nodes. running Ganglia 3.6.1. The difference in the new cluster was changing globals { send_metadata_interval } from 0 to 120, which you already have. The following is the globals on the aggregator gmond: globals { daemonize = yes setuid = yes user = nobody debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no allow_extra_data = yes host_dmax = 86400 /*secs. Expires (removes from web interface) hosts in 1 day */ host_tmax = 20 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no # If you are not using multicast this value should be set to something other than 0. # Otherwise if you restart aggregator gmond you will get empty graphs. 60 seconds is reasonable send_metadata_interval = 60 /*secs */ } I also increased the UDP buffer size on the aggregator, to the value set in the kernel sysctl net.core.rmem_max: udp_recv_channel { ... buffer = 4194304 } On the gmetad, I use memcached. It only runs the default 4 threads. Good luck, Dave On Tue, Aug 18, 2015 at 7:25 AM, Ludmil Stamboliyski l.stamboliy...@ucdn.com wrote: Hello, I am testing deploing ganglia to monitor our servers. I have several clusters - most of them are small ones, but I do have two large ones - with over 150 machines to monitor. The issue is that I do not receive all monitoring data from the machines in large clusters - ganglia-web reports clusters down, in graphite and in rrd I see very few points with data for machines in this large clustes - so by my calculations 2/3 of the data is lost. I am using gmond in unicast mode. Here are examples of my configs: Example of config in a monitored server: globals { daemonize = yes setuid = yes user = ganglia debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no host_dmax = 86400 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no send_metadata_interval = 60 override_hostname = !! HUMAN READABLE HOSTNAME !! } cluster { name = Example large cluster owner = unspecified latlong = unspecified url = unspecified } udp_send_channel { host =
[Ganglia-general] Metrics graphs not showing data
I've just setup a new installation of Ganglia 3.7.1 on a CentOS host. This is monitoring a Windows 2008 R2 HPC cluster (using the Microsoft HPC Ganglia 3.x add-on running on the head node). Data is coming through, and the cluster status graphs are being populated. After changing 'case_sensitive_hostnames' to 'false' in conf_default.php, I also started getting node graphs displayed too, so that part looks ok. One problem though is that, on the main cluster status screen, you can select individual metrics from a drop list, which should update the graphs for each compute node. I can see the graphs, but they're not populated at all. None of the metrics I select show any graph data. If I select something like 'cpu_speed' or 'ip_address' though, this information does show and is correct - it's just the graphs that don't. The graphs do change colour though based on the 1-minute load. Any ideas? It would be great if I could get this working, as it's a very useful snapshot view. Thanks Roke Manor Research Limited, Romsey, Hampshire, SO51 0ZN, United Kingdom.Part of the Chemring Group. Registered in England Wales. Registered No: 00267550 http://www.roke.co.uk Please update your address book. Roke is currently transitioning to its original brand and will no longer be branded under Chemring Technology Solutions. Email addresses of Roke staff have therefore been changed from firstname.surn...@chemringts.com to firstname.surna...@roke.co.uk – please use this updated format with immediate effect. The information contained in this e-mail and any attachments is proprietary to Roke Manor Research Limited and must not be passed to any third party without permission. This communication is for information only and shall not create or change any contractual relationship. -- ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] missing data on large clusters
Thank you Dave, I've done that, but to no avail. Then I do the following - ran separate gmeta for this cluster - up to no avail. Then I thought why I do not make gmeta pull data each second: data_source example large cluster 1 127.0.0.1:port And it seems almost working now - i got data every two minutes. So clearly bottlenect is between gmond collector and gmeta - any ideas how to improve things there? Gmeta runs with rrds off, it sends data to carbon server. Also it has memcached. On вт, авг 18, 2015 at 6:04 , David Chin david.c...@drexel.edu wrote: Hi Ludmil: I had a similar problem a couple of years ago on a cluster with about 200 nodes. Currently, in a new place, I have about 120 nodes. running Ganglia 3.6.1. The difference in the new cluster was changing globals { send_metadata_interval } from 0 to 120, which you already have. The following is the globals on the aggregator gmond: globals { daemonize = yes setuid = yes user = nobody debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no allow_extra_data = yes host_dmax = 86400 /*secs. Expires (removes from web interface) hosts in 1 day */ host_tmax = 20 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no # If you are not using multicast this value should be set to something other than 0. # Otherwise if you restart aggregator gmond you will get empty graphs. 60 seconds is reasonable send_metadata_interval = 60 /*secs */ } I also increased the UDP buffer size on the aggregator, to the value set in the kernel sysctl net.core.rmem_max: udp_recv_channel { ... buffer = 4194304 } On the gmetad, I use memcached. It only runs the default 4 threads. Good luck, Dave On Tue, Aug 18, 2015 at 7:25 AM, Ludmil Stamboliyski l.stamboliy...@ucdn.com wrote: Hello, I am testing deploing ganglia to monitor our servers. I have several clusters - most of them are small ones, but I do have two large ones - with over 150 machines to monitor. The issue is that I do not receive all monitoring data from the machines in large clusters - ganglia-web reports clusters down, in graphite and in rrd I see very few points with data for machines in this large clustes - so by my calculations 2/3 of the data is lost. I am using gmond in unicast mode. Here are examples of my configs: Example of config in a monitored server: globals { daemonize = yes setuid = yes user = ganglia debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no host_dmax = 86400 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no send_metadata_interval = 60 override_hostname = !! HUMAN READABLE HOSTNAME !! } cluster { name = Example large cluster owner = unspecified latlong = unspecified url = unspecified } udp_send_channel { host = ip.addr.of.master port = 8654 ttl = 1 } udp_recv_channel { port = 8649 } tcp_accept_channel { port = 8649 } # Metric conf follows ... Example of config of gmond collector on master node: globals { daemonize = yes setuid = yes user = ganglia debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no allow_extra_data = yes host_dmax = 86400 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no send_metadata_interval = 120 } cluster { name = Example large cluster owner = unspecified latlong = unspecified url = unspecified } udp_send_channel { host = localhost port = 8654 ttl = 1 } udp_recv_channel { port = 8654 } tcp_accept_channel { port = 8654 } And here is example of my gmetad.con:f: data_source ... data_source Example large cluster localhost:8654 data_source ... server_threads 16 In logs I see a lots of Error 1 sending the modular data data_source - searched various threads but did not found anything helpful. I checked the network settings and tuned the udp accordingly - the server do not drop packets, also checked on the switch - there are no drops and loses. Load is rarely seen above 1.5 and this is 16 core server with 128GB of ram. I ran the collector and gmeta in debug and it seemed fine. I am really lost, so I'll be grateful for any help. -- ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- David Chin, Ph.D. david.c...@drexel.eduSr. Systems Administrator, URCF, Drexel U. http://www.drexel.edu/research/urcf/ https://linuxfollies.blogspot.com/ +1.215.221.4747 (mobile) https://github.com/prehensilecode -- ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general