Hi Bostjan and thank you for your time, My setup is: gmond deamons for each machine monitoring configured in unicast, 8 clusters, and one master node on which I have gmond daemon for each cluster running on different port. On the master node I have gmeta daemon configured to send data to carbon-cache and with rrds off (actually I have second gmeta for rrds which is turned off while I am investigating this issue). All hosts are present in xml and they get their field "Reported" changed on every run, so I think gmond collector works correctly.
2015-08-19 14:48 GMT+03:00 Bostjan Skufca <bost...@a2o.si>: > Ludmil, > > do you have multiple headnodes? Do they receive data from all the > nodes? If yes, did you verify it (telnet to each headnode to port 8649 > and count occurences of <HOST...> xml tag)? > > b. > > > On 19 August 2015 at 12:01, Ludmil Stamboliyski <l.stamboliy...@ucdn.com> > wrote: > > Thank you Dave, > > > > I've done that, but to no avail. Then I do the following - ran separate > > gmeta for this cluster - up to no avail. Then I thought why I do not make > > gmeta pull data each second: > > data_source "example large cluster" 1 127.0.0.1:port > > > > And it seems almost working now - i got data every two minutes. So > clearly > > bottlenect is between gmond collector and gmeta - any ideas how to > improve > > things there? > > Gmeta runs with rrds off, it sends data to carbon server. Also it has > > memcached. > > > > > > On вт, авг 18, 2015 at 6:04 , David Chin <david.c...@drexel.edu> wrote: > > > > Hi Ludmil: > > > > I had a similar problem a couple of years ago on a cluster with about 200 > > nodes. > > > > Currently, in a new place, I have about 120 nodes. running Ganglia 3.6.1. > > The difference in the new cluster was changing "globals { > > send_metadata_interval }" from 0 to 120, which you already have. The > > following is the globals on the aggregator gmond: > > > > globals { > > daemonize = yes > > setuid = yes > > user = nobody > > debug_level = 0 > > max_udp_msg_len = 1472 > > mute = no > > deaf = no > > allow_extra_data = yes > > host_dmax = 86400 /*secs. Expires (removes from web interface) hosts > in 1 > > day */ > > host_tmax = 20 /*secs */ > > cleanup_threshold = 300 /*secs */ > > gexec = no > > # If you are not using multicast this value should be set to something > > other than 0. > > # Otherwise if you restart aggregator gmond you will get empty graphs. > 60 > > seconds is reasonable > > send_metadata_interval = 60 /*secs */ > > } > > > > I also increased the UDP buffer size on the aggregator, to the value set > in > > the kernel " sysctl net.core.rmem_max": > > > > udp_recv_channel { ... buffer = 4194304 } > > > > On the gmetad, I use memcached. It only runs the default 4 threads. > > > > Good luck, > > Dave > > > > On Tue, Aug 18, 2015 at 7:25 AM, Ludmil Stamboliyski > > <l.stamboliy...@ucdn.com> wrote: > >> > >> Hello, > >> > >> I am testing deploing ganglia to monitor our servers. I have several > >> clusters - most of them are small ones, but I do have two large ones - > with > >> over 150 machines to monitor. The issue is that I do not receive all > >> monitoring data from the machines in large clusters - ganglia-web > reports > >> clusters down, in graphite and in rrd I see very few points with data > for > >> machines in this large clustes - so by my calculations 2/3 of the data > is > >> lost. I am using gmond in unicast mode. Here are examples of my configs: > >> > >> > >> Example of config in a monitored server: > >> > >> globals { > >> daemonize = yes > >> setuid = yes > >> user = ganglia > >> debug_level = 0 > >> max_udp_msg_len = 1472 > >> mute = no > >> deaf = no > >> host_dmax = 86400 /*secs */ > >> cleanup_threshold = 300 /*secs */ > >> gexec = no > >> send_metadata_interval = 60 > >> override_hostname = "<<!! HUMAN READABLE HOSTNAME !!>>" > >> } > >> cluster { > >> name = "Example large cluster" > >> owner = "unspecified" > >> latlong = "unspecified" > >> url = "unspecified" > >> } > >> udp_send_channel { > >> host = ip.addr.of.master > >> port = 8654 > >> ttl = 1 > >> } > >> udp_recv_channel { > >> port = 8649 > >> } > >> tcp_accept_channel { > >> port = 8649 > >> } > >> # Metric conf follows ... > >> > >> Example of config of gmond collector on master node: > >> > >> globals { > >> daemonize = yes > >> setuid = yes > >> user = ganglia > >> debug_level = 0 > >> max_udp_msg_len = 1472 > >> mute = no > >> deaf = no > >> allow_extra_data = yes > >> host_dmax = 86400 /*secs */ > >> cleanup_threshold = 300 /*secs */ > >> gexec = no > >> send_metadata_interval = 120 > >> } > >> cluster { > >> name = "Example large cluster" > >> owner = "unspecified" > >> latlong = "unspecified" > >> url = "unspecified" > >> } > >> udp_send_channel { > >> host = localhost > >> port = 8654 > >> ttl = 1 > >> } > >> udp_recv_channel { > >> port = 8654 > >> } > >> tcp_accept_channel { > >> port = 8654 > >> } > >> > >> > >> And here is example of my gmetad.con:f: > >> > >> data_source ... > >> data_source "Example large cluster" localhost:8654 > >> data_source ... > >> > >> server_threads 16 > >> > >> > >> In logs I see a lots of "Error 1 sending the modular data data_source" - > >> searched various threads but did not found anything helpful. > >> I checked the network settings and tuned the udp accordingly - the > server > >> do not drop packets, also checked on the switch - there are no drops and > >> loses. Load is rarely seen above 1.5 and this is 16 core server with > 128GB > >> of ram. I ran the collector and gmeta in debug and it seemed fine. > >> > >> I am really lost, so I'll be grateful for any help. > >> > >> > >> > >> > ------------------------------------------------------------------------------ > >> > >> _______________________________________________ > >> Ganglia-general mailing list > >> Ganglia-general@lists.sourceforge.net > >> https://lists.sourceforge.net/lists/listinfo/ganglia-general > >> > > > > > > > > -- > > David Chin, Ph.D. > > david.c...@drexel.edu Sr. Systems Administrator, URCF, Drexel U. > > http://www.drexel.edu/research/urcf/ > > https://linuxfollies.blogspot.com/ > > +1.215.221.4747 (mobile) > > https://github.com/prehensilecode > > > > > > > ------------------------------------------------------------------------------ > > > > _______________________________________________ > > Ganglia-general mailing list > > Ganglia-general@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/ganglia-general > > >
------------------------------------------------------------------------------
_______________________________________________ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general