So... I got the culprit - it turns out that carbon-cache is slowing down the whole gmeta daemon... Now with pool interval 1 and rrd it finally became stable and began to load the machine as expected. Next thing to answer is why is carbon so slow?
2015-08-19 22:59 GMT+03:00 Bostjan Skufca <bost...@a2o.si>: > Does increasing gmetad's debug level (runs in foreground) yield anything > useful? > > > > On 19 August 2015 at 21:15, Ludmil Stamboliyski <l.stamboliy...@ucdn.com> > wrote: > > Hi Bostjan and thank you for your time, > > > > My setup is: > > gmond deamons for each machine monitoring configured in unicast, 8 > clusters, > > and one master node on which I have gmond daemon for each cluster > running on > > different port. On the master node I have gmeta daemon configured to send > > data to carbon-cache and with rrds off (actually I have second gmeta for > > rrds which is turned off while I am investigating this issue). All hosts > are > > present in xml and they get their field "Reported" changed on every run, > so > > I think gmond collector works correctly. > > > > 2015-08-19 14:48 GMT+03:00 Bostjan Skufca <bost...@a2o.si>: > >> > >> Ludmil, > >> > >> do you have multiple headnodes? Do they receive data from all the > >> nodes? If yes, did you verify it (telnet to each headnode to port 8649 > >> and count occurences of <HOST...> xml tag)? > >> > >> b. > >> > >> > >> On 19 August 2015 at 12:01, Ludmil Stamboliyski < > l.stamboliy...@ucdn.com> > >> wrote: > >> > Thank you Dave, > >> > > >> > I've done that, but to no avail. Then I do the following - ran > separate > >> > gmeta for this cluster - up to no avail. Then I thought why I do not > >> > make > >> > gmeta pull data each second: > >> > data_source "example large cluster" 1 127.0.0.1:port > >> > > >> > And it seems almost working now - i got data every two minutes. So > >> > clearly > >> > bottlenect is between gmond collector and gmeta - any ideas how to > >> > improve > >> > things there? > >> > Gmeta runs with rrds off, it sends data to carbon server. Also it has > >> > memcached. > >> > > >> > > >> > On вт, авг 18, 2015 at 6:04 , David Chin <david.c...@drexel.edu> > wrote: > >> > > >> > Hi Ludmil: > >> > > >> > I had a similar problem a couple of years ago on a cluster with about > >> > 200 > >> > nodes. > >> > > >> > Currently, in a new place, I have about 120 nodes. running Ganglia > >> > 3.6.1. > >> > The difference in the new cluster was changing "globals { > >> > send_metadata_interval }" from 0 to 120, which you already have. The > >> > following is the globals on the aggregator gmond: > >> > > >> > globals { > >> > daemonize = yes > >> > setuid = yes > >> > user = nobody > >> > debug_level = 0 > >> > max_udp_msg_len = 1472 > >> > mute = no > >> > deaf = no > >> > allow_extra_data = yes > >> > host_dmax = 86400 /*secs. Expires (removes from web interface) hosts > >> > in 1 > >> > day */ > >> > host_tmax = 20 /*secs */ > >> > cleanup_threshold = 300 /*secs */ > >> > gexec = no > >> > # If you are not using multicast this value should be set to > something > >> > other than 0. > >> > # Otherwise if you restart aggregator gmond you will get empty > graphs. > >> > 60 > >> > seconds is reasonable > >> > send_metadata_interval = 60 /*secs */ > >> > } > >> > > >> > I also increased the UDP buffer size on the aggregator, to the value > set > >> > in > >> > the kernel " sysctl net.core.rmem_max": > >> > > >> > udp_recv_channel { ... buffer = 4194304 } > >> > > >> > On the gmetad, I use memcached. It only runs the default 4 threads. > >> > > >> > Good luck, > >> > Dave > >> > > >> > On Tue, Aug 18, 2015 at 7:25 AM, Ludmil Stamboliyski > >> > <l.stamboliy...@ucdn.com> wrote: > >> >> > >> >> Hello, > >> >> > >> >> I am testing deploing ganglia to monitor our servers. I have several > >> >> clusters - most of them are small ones, but I do have two large ones > - > >> >> with > >> >> over 150 machines to monitor. The issue is that I do not receive all > >> >> monitoring data from the machines in large clusters - ganglia-web > >> >> reports > >> >> clusters down, in graphite and in rrd I see very few points with data > >> >> for > >> >> machines in this large clustes - so by my calculations 2/3 of the > data > >> >> is > >> >> lost. I am using gmond in unicast mode. Here are examples of my > >> >> configs: > >> >> > >> >> > >> >> Example of config in a monitored server: > >> >> > >> >> globals { > >> >> daemonize = yes > >> >> setuid = yes > >> >> user = ganglia > >> >> debug_level = 0 > >> >> max_udp_msg_len = 1472 > >> >> mute = no > >> >> deaf = no > >> >> host_dmax = 86400 /*secs */ > >> >> cleanup_threshold = 300 /*secs */ > >> >> gexec = no > >> >> send_metadata_interval = 60 > >> >> override_hostname = "<<!! HUMAN READABLE HOSTNAME !!>>" > >> >> } > >> >> cluster { > >> >> name = "Example large cluster" > >> >> owner = "unspecified" > >> >> latlong = "unspecified" > >> >> url = "unspecified" > >> >> } > >> >> udp_send_channel { > >> >> host = ip.addr.of.master > >> >> port = 8654 > >> >> ttl = 1 > >> >> } > >> >> udp_recv_channel { > >> >> port = 8649 > >> >> } > >> >> tcp_accept_channel { > >> >> port = 8649 > >> >> } > >> >> # Metric conf follows ... > >> >> > >> >> Example of config of gmond collector on master node: > >> >> > >> >> globals { > >> >> daemonize = yes > >> >> setuid = yes > >> >> user = ganglia > >> >> debug_level = 0 > >> >> max_udp_msg_len = 1472 > >> >> mute = no > >> >> deaf = no > >> >> allow_extra_data = yes > >> >> host_dmax = 86400 /*secs */ > >> >> cleanup_threshold = 300 /*secs */ > >> >> gexec = no > >> >> send_metadata_interval = 120 > >> >> } > >> >> cluster { > >> >> name = "Example large cluster" > >> >> owner = "unspecified" > >> >> latlong = "unspecified" > >> >> url = "unspecified" > >> >> } > >> >> udp_send_channel { > >> >> host = localhost > >> >> port = 8654 > >> >> ttl = 1 > >> >> } > >> >> udp_recv_channel { > >> >> port = 8654 > >> >> } > >> >> tcp_accept_channel { > >> >> port = 8654 > >> >> } > >> >> > >> >> > >> >> And here is example of my gmetad.con:f: > >> >> > >> >> data_source ... > >> >> data_source "Example large cluster" localhost:8654 > >> >> data_source ... > >> >> > >> >> server_threads 16 > >> >> > >> >> > >> >> In logs I see a lots of "Error 1 sending the modular data > data_source" > >> >> - > >> >> searched various threads but did not found anything helpful. > >> >> I checked the network settings and tuned the udp accordingly - the > >> >> server > >> >> do not drop packets, also checked on the switch - there are no drops > >> >> and > >> >> loses. Load is rarely seen above 1.5 and this is 16 core server with > >> >> 128GB > >> >> of ram. I ran the collector and gmeta in debug and it seemed fine. > >> >> > >> >> I am really lost, so I'll be grateful for any help. > >> >> > >> >> > >> >> > >> >> > >> >> > ------------------------------------------------------------------------------ > >> >> > >> >> _______________________________________________ > >> >> Ganglia-general mailing list > >> >> Ganglia-general@lists.sourceforge.net > >> >> https://lists.sourceforge.net/lists/listinfo/ganglia-general > >> >> > >> > > >> > > >> > > >> > -- > >> > David Chin, Ph.D. > >> > david.c...@drexel.edu Sr. Systems Administrator, URCF, Drexel U. > >> > http://www.drexel.edu/research/urcf/ > >> > https://linuxfollies.blogspot.com/ > >> > +1.215.221.4747 (mobile) > >> > https://github.com/prehensilecode > >> > > >> > > >> > > >> > > ------------------------------------------------------------------------------ > >> > > >> > _______________________________________________ > >> > Ganglia-general mailing list > >> > Ganglia-general@lists.sourceforge.net > >> > https://lists.sourceforge.net/lists/listinfo/ganglia-general > >> > > > > > >
------------------------------------------------------------------------------
_______________________________________________ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general