Re: [Ganglia-general] missing data on large clusters

Ludmil Stamboliyski Wed, 19 Aug 2015 13:37:38 -0700

So... I got the culprit - it turns out that carbon-cache is slowing down
the whole gmeta daemon... Now with pool interval 1 and rrd it finally
became stable and began to load the machine as expected. Next thing to
answer is why is carbon so slow?


2015-08-19 22:59 GMT+03:00 Bostjan Skufca <bost...@a2o.si>:

> Does increasing gmetad's debug level (runs in foreground) yield anything
> useful?
>
>
>
> On 19 August 2015 at 21:15, Ludmil Stamboliyski <l.stamboliy...@ucdn.com>
> wrote:
> > Hi Bostjan and thank you for your time,
> >
> > My setup is:
> > gmond deamons for each machine monitoring configured in unicast, 8
> clusters,
> > and one master node on which I have gmond daemon for each cluster
> running on
> > different port. On the master node I have gmeta daemon configured to send
> > data to carbon-cache and with rrds off (actually I have second gmeta for
> > rrds which is turned off while I am investigating this issue). All hosts
> are
> > present in xml and they get their field "Reported" changed on every run,
> so
> > I think gmond collector works correctly.
> >
> > 2015-08-19 14:48 GMT+03:00 Bostjan Skufca <bost...@a2o.si>:
> >>
> >> Ludmil,
> >>
> >> do you have multiple headnodes? Do they receive data from all the
> >> nodes? If yes, did you verify it (telnet to each headnode to port 8649
> >> and count occurences of <HOST...> xml tag)?
> >>
> >> b.
> >>
> >>
> >> On 19 August 2015 at 12:01, Ludmil Stamboliyski <
> l.stamboliy...@ucdn.com>
> >> wrote:
> >> > Thank you Dave,
> >> >
> >> > I've done that, but to no avail.  Then I do the following - ran
> separate
> >> > gmeta for this cluster - up to no avail. Then I thought why I do not
> >> > make
> >> > gmeta pull data each second:
> >> > data_source "example large cluster" 1 127.0.0.1:port
> >> >
> >> > And it seems almost working now - i got data every two minutes. So
> >> > clearly
> >> > bottlenect is between gmond collector and gmeta - any ideas how to
> >> > improve
> >> > things there?
> >> > Gmeta runs with rrds off, it sends data to carbon server. Also it has
> >> > memcached.
> >> >
> >> >
> >> > On вт, авг 18, 2015 at 6:04 , David Chin <david.c...@drexel.edu>
> wrote:
> >> >
> >> > Hi Ludmil:
> >> >
> >> > I had a similar problem a couple of years ago on a cluster with about
> >> > 200
> >> > nodes.
> >> >
> >> > Currently, in a new place, I have about 120 nodes. running Ganglia
> >> > 3.6.1.
> >> > The difference in the new cluster was changing "globals {
> >> > send_metadata_interval }" from 0 to 120, which you already have. The
> >> > following is the globals on the aggregator gmond:
> >> >
> >> > globals {
> >> >   daemonize = yes
> >> >   setuid = yes
> >> >   user = nobody
> >> >   debug_level = 0
> >> >   max_udp_msg_len = 1472
> >> >   mute = no
> >> >   deaf = no
> >> >   allow_extra_data = yes
> >> >   host_dmax = 86400 /*secs. Expires (removes from web interface) hosts
> >> > in 1
> >> > day */
> >> >   host_tmax = 20 /*secs */
> >> >   cleanup_threshold = 300 /*secs */
> >> >   gexec = no
> >> >   # If you are not using multicast this value should be set to
> something
> >> > other than 0.
> >> >   # Otherwise if you restart aggregator gmond you will get empty
> graphs.
> >> > 60
> >> > seconds is reasonable
> >> >   send_metadata_interval = 60 /*secs */
> >> > }
> >> >
> >> > I also increased the UDP buffer size on the aggregator, to the value
> set
> >> > in
> >> > the kernel " sysctl net.core.rmem_max":
> >> >
> >> >      udp_recv_channel { ... buffer = 4194304 }
> >> >
> >> > On the gmetad, I use memcached. It only runs the default 4 threads.
> >> >
> >> > Good luck,
> >> >     Dave
> >> >
> >> > On Tue, Aug 18, 2015 at 7:25 AM, Ludmil Stamboliyski
> >> > <l.stamboliy...@ucdn.com> wrote:
> >> >>
> >> >> Hello,
> >> >>
> >> >> I am testing deploing ganglia to monitor our servers. I have several
> >> >> clusters - most of them are small ones, but I do have two large ones
> -
> >> >> with
> >> >> over 150 machines to monitor. The issue is that I do not receive all
> >> >> monitoring data from the machines in large clusters - ganglia-web
> >> >> reports
> >> >> clusters down, in graphite and in rrd I see very few points with data
> >> >> for
> >> >> machines in this large clustes - so by my calculations 2/3 of the
> data
> >> >> is
> >> >> lost. I am using gmond in unicast mode. Here are examples of my
> >> >> configs:
> >> >>
> >> >>
> >> >> Example of config in a monitored server:
> >> >>
> >> >> globals {
> >> >>   daemonize = yes
> >> >>   setuid = yes
> >> >>   user = ganglia
> >> >>   debug_level = 0
> >> >>   max_udp_msg_len = 1472
> >> >>   mute = no
> >> >>   deaf = no
> >> >>   host_dmax = 86400 /*secs */
> >> >>   cleanup_threshold = 300 /*secs */
> >> >>   gexec = no
> >> >>   send_metadata_interval = 60
> >> >>   override_hostname = "<<!! HUMAN READABLE HOSTNAME !!>>"
> >> >> }
> >> >> cluster {
> >> >>   name = "Example large cluster"
> >> >>   owner = "unspecified"
> >> >>   latlong = "unspecified"
> >> >>   url = "unspecified"
> >> >> }
> >> >> udp_send_channel {
> >> >>   host = ip.addr.of.master
> >> >>   port = 8654
> >> >>   ttl = 1
> >> >> }
> >> >> udp_recv_channel {
> >> >>   port = 8649
> >> >> }
> >> >> tcp_accept_channel {
> >> >>   port = 8649
> >> >> }
> >> >> # Metric conf follows ...
> >> >>
> >> >> Example of config of gmond collector on master node:
> >> >>
> >> >> globals {
> >> >>   daemonize = yes
> >> >>   setuid = yes
> >> >>   user = ganglia
> >> >>   debug_level = 0
> >> >>   max_udp_msg_len = 1472
> >> >>   mute = no
> >> >>   deaf = no
> >> >>   allow_extra_data = yes
> >> >>   host_dmax = 86400 /*secs */
> >> >>   cleanup_threshold = 300 /*secs */
> >> >>   gexec = no
> >> >>   send_metadata_interval = 120
> >> >> }
> >> >> cluster {
> >> >>   name = "Example large cluster"
> >> >>   owner = "unspecified"
> >> >>   latlong = "unspecified"
> >> >>   url = "unspecified"
> >> >> }
> >> >> udp_send_channel {
> >> >>   host = localhost
> >> >>   port = 8654
> >> >>   ttl = 1
> >> >> }
> >> >> udp_recv_channel {
> >> >>   port = 8654
> >> >> }
> >> >> tcp_accept_channel {
> >> >>   port = 8654
> >> >> }
> >> >>
> >> >>
> >> >> And here is example of my gmetad.con:f:
> >> >>
> >> >> data_source ...
> >> >> data_source "Example large cluster" localhost:8654
> >> >> data_source ...
> >> >>
> >> >> server_threads 16
> >> >>
> >> >>
> >> >> In logs I see a lots of "Error 1 sending the modular data
> data_source"
> >> >> -
> >> >> searched various threads but did not found anything helpful.
> >> >> I checked the network settings and tuned the udp accordingly - the
> >> >> server
> >> >> do not drop packets, also checked on the switch - there are no drops
> >> >> and
> >> >> loses. Load is rarely seen above 1.5 and this is 16 core server with
> >> >> 128GB
> >> >> of ram. I ran the collector and gmeta in debug and it seemed fine.
> >> >>
> >> >> I am really lost, so I'll be grateful for any help.
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> ------------------------------------------------------------------------------
> >> >>
> >> >> _______________________________________________
> >> >> Ganglia-general mailing list
> >> >> Ganglia-general@lists.sourceforge.net
> >> >> https://lists.sourceforge.net/lists/listinfo/ganglia-general
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > David Chin, Ph.D.
> >> > david.c...@drexel.edu    Sr. Systems Administrator, URCF, Drexel U.
> >> > http://www.drexel.edu/research/urcf/
> >> > https://linuxfollies.blogspot.com/
> >> > +1.215.221.4747 (mobile)
> >> > https://github.com/prehensilecode
> >> >
> >> >
> >> >
> >> >
> ------------------------------------------------------------------------------
> >> >
> >> > _______________________________________________
> >> > Ganglia-general mailing list
> >> > Ganglia-general@lists.sourceforge.net
> >> > https://lists.sourceforge.net/lists/listinfo/ganglia-general
> >> >
> >
> >
>

------------------------------------------------------------------------------

_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] missing data on large clusters

Reply via email to