Does increasing gmetad's debug level (runs in foreground) yield anything useful?
On 19 August 2015 at 21:15, Ludmil Stamboliyski <l.stamboliy...@ucdn.com> wrote: > Hi Bostjan and thank you for your time, > > My setup is: > gmond deamons for each machine monitoring configured in unicast, 8 clusters, > and one master node on which I have gmond daemon for each cluster running on > different port. On the master node I have gmeta daemon configured to send > data to carbon-cache and with rrds off (actually I have second gmeta for > rrds which is turned off while I am investigating this issue). All hosts are > present in xml and they get their field "Reported" changed on every run, so > I think gmond collector works correctly. > > 2015-08-19 14:48 GMT+03:00 Bostjan Skufca <bost...@a2o.si>: >> >> Ludmil, >> >> do you have multiple headnodes? Do they receive data from all the >> nodes? If yes, did you verify it (telnet to each headnode to port 8649 >> and count occurences of <HOST...> xml tag)? >> >> b. >> >> >> On 19 August 2015 at 12:01, Ludmil Stamboliyski <l.stamboliy...@ucdn.com> >> wrote: >> > Thank you Dave, >> > >> > I've done that, but to no avail. Then I do the following - ran separate >> > gmeta for this cluster - up to no avail. Then I thought why I do not >> > make >> > gmeta pull data each second: >> > data_source "example large cluster" 1 127.0.0.1:port >> > >> > And it seems almost working now - i got data every two minutes. So >> > clearly >> > bottlenect is between gmond collector and gmeta - any ideas how to >> > improve >> > things there? >> > Gmeta runs with rrds off, it sends data to carbon server. Also it has >> > memcached. >> > >> > >> > On вт, авг 18, 2015 at 6:04 , David Chin <david.c...@drexel.edu> wrote: >> > >> > Hi Ludmil: >> > >> > I had a similar problem a couple of years ago on a cluster with about >> > 200 >> > nodes. >> > >> > Currently, in a new place, I have about 120 nodes. running Ganglia >> > 3.6.1. >> > The difference in the new cluster was changing "globals { >> > send_metadata_interval }" from 0 to 120, which you already have. The >> > following is the globals on the aggregator gmond: >> > >> > globals { >> > daemonize = yes >> > setuid = yes >> > user = nobody >> > debug_level = 0 >> > max_udp_msg_len = 1472 >> > mute = no >> > deaf = no >> > allow_extra_data = yes >> > host_dmax = 86400 /*secs. Expires (removes from web interface) hosts >> > in 1 >> > day */ >> > host_tmax = 20 /*secs */ >> > cleanup_threshold = 300 /*secs */ >> > gexec = no >> > # If you are not using multicast this value should be set to something >> > other than 0. >> > # Otherwise if you restart aggregator gmond you will get empty graphs. >> > 60 >> > seconds is reasonable >> > send_metadata_interval = 60 /*secs */ >> > } >> > >> > I also increased the UDP buffer size on the aggregator, to the value set >> > in >> > the kernel " sysctl net.core.rmem_max": >> > >> > udp_recv_channel { ... buffer = 4194304 } >> > >> > On the gmetad, I use memcached. It only runs the default 4 threads. >> > >> > Good luck, >> > Dave >> > >> > On Tue, Aug 18, 2015 at 7:25 AM, Ludmil Stamboliyski >> > <l.stamboliy...@ucdn.com> wrote: >> >> >> >> Hello, >> >> >> >> I am testing deploing ganglia to monitor our servers. I have several >> >> clusters - most of them are small ones, but I do have two large ones - >> >> with >> >> over 150 machines to monitor. The issue is that I do not receive all >> >> monitoring data from the machines in large clusters - ganglia-web >> >> reports >> >> clusters down, in graphite and in rrd I see very few points with data >> >> for >> >> machines in this large clustes - so by my calculations 2/3 of the data >> >> is >> >> lost. I am using gmond in unicast mode. Here are examples of my >> >> configs: >> >> >> >> >> >> Example of config in a monitored server: >> >> >> >> globals { >> >> daemonize = yes >> >> setuid = yes >> >> user = ganglia >> >> debug_level = 0 >> >> max_udp_msg_len = 1472 >> >> mute = no >> >> deaf = no >> >> host_dmax = 86400 /*secs */ >> >> cleanup_threshold = 300 /*secs */ >> >> gexec = no >> >> send_metadata_interval = 60 >> >> override_hostname = "<<!! HUMAN READABLE HOSTNAME !!>>" >> >> } >> >> cluster { >> >> name = "Example large cluster" >> >> owner = "unspecified" >> >> latlong = "unspecified" >> >> url = "unspecified" >> >> } >> >> udp_send_channel { >> >> host = ip.addr.of.master >> >> port = 8654 >> >> ttl = 1 >> >> } >> >> udp_recv_channel { >> >> port = 8649 >> >> } >> >> tcp_accept_channel { >> >> port = 8649 >> >> } >> >> # Metric conf follows ... >> >> >> >> Example of config of gmond collector on master node: >> >> >> >> globals { >> >> daemonize = yes >> >> setuid = yes >> >> user = ganglia >> >> debug_level = 0 >> >> max_udp_msg_len = 1472 >> >> mute = no >> >> deaf = no >> >> allow_extra_data = yes >> >> host_dmax = 86400 /*secs */ >> >> cleanup_threshold = 300 /*secs */ >> >> gexec = no >> >> send_metadata_interval = 120 >> >> } >> >> cluster { >> >> name = "Example large cluster" >> >> owner = "unspecified" >> >> latlong = "unspecified" >> >> url = "unspecified" >> >> } >> >> udp_send_channel { >> >> host = localhost >> >> port = 8654 >> >> ttl = 1 >> >> } >> >> udp_recv_channel { >> >> port = 8654 >> >> } >> >> tcp_accept_channel { >> >> port = 8654 >> >> } >> >> >> >> >> >> And here is example of my gmetad.con:f: >> >> >> >> data_source ... >> >> data_source "Example large cluster" localhost:8654 >> >> data_source ... >> >> >> >> server_threads 16 >> >> >> >> >> >> In logs I see a lots of "Error 1 sending the modular data data_source" >> >> - >> >> searched various threads but did not found anything helpful. >> >> I checked the network settings and tuned the udp accordingly - the >> >> server >> >> do not drop packets, also checked on the switch - there are no drops >> >> and >> >> loses. Load is rarely seen above 1.5 and this is 16 core server with >> >> 128GB >> >> of ram. I ran the collector and gmeta in debug and it seemed fine. >> >> >> >> I am really lost, so I'll be grateful for any help. >> >> >> >> >> >> >> >> >> >> ------------------------------------------------------------------------------ >> >> >> >> _______________________________________________ >> >> Ganglia-general mailing list >> >> Ganglia-general@lists.sourceforge.net >> >> https://lists.sourceforge.net/lists/listinfo/ganglia-general >> >> >> > >> > >> > >> > -- >> > David Chin, Ph.D. >> > david.c...@drexel.edu Sr. Systems Administrator, URCF, Drexel U. >> > http://www.drexel.edu/research/urcf/ >> > https://linuxfollies.blogspot.com/ >> > +1.215.221.4747 (mobile) >> > https://github.com/prehensilecode >> > >> > >> > >> > ------------------------------------------------------------------------------ >> > >> > _______________________________________________ >> > Ganglia-general mailing list >> > Ganglia-general@lists.sourceforge.net >> > https://lists.sourceforge.net/lists/listinfo/ganglia-general >> > > > ------------------------------------------------------------------------------ _______________________________________________ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general