Re: [Ganglia-general] missing data on large clusters

Ludmil Stamboliyski Wed, 19 Aug 2015 12:19:31 -0700

Hi Bostjan and thank you for your time,

My setup is:
gmond deamons for each machine monitoring configured in unicast, 8
clusters, and one master node on which I have gmond daemon for each cluster
running on different port. On the master node I have gmeta daemon
configured to send data to carbon-cache and with rrds off (actually I have
second gmeta for rrds which is turned off while I am investigating this
issue). All hosts are present in xml and they get their field "Reported"
changed on every run, so I think gmond collector works correctly.


2015-08-19 14:48 GMT+03:00 Bostjan Skufca <bost...@a2o.si>:

> Ludmil,
>
> do you have multiple headnodes? Do they receive data from all the
> nodes? If yes, did you verify it (telnet to each headnode to port 8649
> and count occurences of <HOST...> xml tag)?
>
> b.
>
>
> On 19 August 2015 at 12:01, Ludmil Stamboliyski <l.stamboliy...@ucdn.com>
> wrote:
> > Thank you Dave,
> >
> > I've done that, but to no avail.  Then I do the following - ran separate
> > gmeta for this cluster - up to no avail. Then I thought why I do not make
> > gmeta pull data each second:
> > data_source "example large cluster" 1 127.0.0.1:port
> >
> > And it seems almost working now - i got data every two minutes. So
> clearly
> > bottlenect is between gmond collector and gmeta - any ideas how to
> improve
> > things there?
> > Gmeta runs with rrds off, it sends data to carbon server. Also it has
> > memcached.
> >
> >
> > On вт, авг 18, 2015 at 6:04 , David Chin <david.c...@drexel.edu> wrote:
> >
> > Hi Ludmil:
> >
> > I had a similar problem a couple of years ago on a cluster with about 200
> > nodes.
> >
> > Currently, in a new place, I have about 120 nodes. running Ganglia 3.6.1.
> > The difference in the new cluster was changing "globals {
> > send_metadata_interval }" from 0 to 120, which you already have. The
> > following is the globals on the aggregator gmond:
> >
> > globals {
> >   daemonize = yes
> >   setuid = yes
> >   user = nobody
> >   debug_level = 0
> >   max_udp_msg_len = 1472
> >   mute = no
> >   deaf = no
> >   allow_extra_data = yes
> >   host_dmax = 86400 /*secs. Expires (removes from web interface) hosts
> in 1
> > day */
> >   host_tmax = 20 /*secs */
> >   cleanup_threshold = 300 /*secs */
> >   gexec = no
> >   # If you are not using multicast this value should be set to something
> > other than 0.
> >   # Otherwise if you restart aggregator gmond you will get empty graphs.
> 60
> > seconds is reasonable
> >   send_metadata_interval = 60 /*secs */
> > }
> >
> > I also increased the UDP buffer size on the aggregator, to the value set
> in
> > the kernel " sysctl net.core.rmem_max":
> >
> >      udp_recv_channel { ... buffer = 4194304 }
> >
> > On the gmetad, I use memcached. It only runs the default 4 threads.
> >
> > Good luck,
> >     Dave
> >
> > On Tue, Aug 18, 2015 at 7:25 AM, Ludmil Stamboliyski
> > <l.stamboliy...@ucdn.com> wrote:
> >>
> >> Hello,
> >>
> >> I am testing deploing ganglia to monitor our servers. I have several
> >> clusters - most of them are small ones, but I do have two large ones -
> with
> >> over 150 machines to monitor. The issue is that I do not receive all
> >> monitoring data from the machines in large clusters - ganglia-web
> reports
> >> clusters down, in graphite and in rrd I see very few points with data
> for
> >> machines in this large clustes - so by my calculations 2/3 of the data
> is
> >> lost. I am using gmond in unicast mode. Here are examples of my configs:
> >>
> >>
> >> Example of config in a monitored server:
> >>
> >> globals {
> >>   daemonize = yes
> >>   setuid = yes
> >>   user = ganglia
> >>   debug_level = 0
> >>   max_udp_msg_len = 1472
> >>   mute = no
> >>   deaf = no
> >>   host_dmax = 86400 /*secs */
> >>   cleanup_threshold = 300 /*secs */
> >>   gexec = no
> >>   send_metadata_interval = 60
> >>   override_hostname = "<<!! HUMAN READABLE HOSTNAME !!>>"
> >> }
> >> cluster {
> >>   name = "Example large cluster"
> >>   owner = "unspecified"
> >>   latlong = "unspecified"
> >>   url = "unspecified"
> >> }
> >> udp_send_channel {
> >>   host = ip.addr.of.master
> >>   port = 8654
> >>   ttl = 1
> >> }
> >> udp_recv_channel {
> >>   port = 8649
> >> }
> >> tcp_accept_channel {
> >>   port = 8649
> >> }
> >> # Metric conf follows ...
> >>
> >> Example of config of gmond collector on master node:
> >>
> >> globals {
> >>   daemonize = yes
> >>   setuid = yes
> >>   user = ganglia
> >>   debug_level = 0
> >>   max_udp_msg_len = 1472
> >>   mute = no
> >>   deaf = no
> >>   allow_extra_data = yes
> >>   host_dmax = 86400 /*secs */
> >>   cleanup_threshold = 300 /*secs */
> >>   gexec = no
> >>   send_metadata_interval = 120
> >> }
> >> cluster {
> >>   name = "Example large cluster"
> >>   owner = "unspecified"
> >>   latlong = "unspecified"
> >>   url = "unspecified"
> >> }
> >> udp_send_channel {
> >>   host = localhost
> >>   port = 8654
> >>   ttl = 1
> >> }
> >> udp_recv_channel {
> >>   port = 8654
> >> }
> >> tcp_accept_channel {
> >>   port = 8654
> >> }
> >>
> >>
> >> And here is example of my gmetad.con:f:
> >>
> >> data_source ...
> >> data_source "Example large cluster" localhost:8654
> >> data_source ...
> >>
> >> server_threads 16
> >>
> >>
> >> In logs I see a lots of "Error 1 sending the modular data data_source" -
> >> searched various threads but did not found anything helpful.
> >> I checked the network settings and tuned the udp accordingly - the
> server
> >> do not drop packets, also checked on the switch - there are no drops and
> >> loses. Load is rarely seen above 1.5 and this is 16 core server with
> 128GB
> >> of ram. I ran the collector and gmeta in debug and it seemed fine.
> >>
> >> I am really lost, so I'll be grateful for any help.
> >>
> >>
> >>
> >>
> ------------------------------------------------------------------------------
> >>
> >> _______________________________________________
> >> Ganglia-general mailing list
> >> Ganglia-general@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/ganglia-general
> >>
> >
> >
> >
> > --
> > David Chin, Ph.D.
> > david.c...@drexel.edu    Sr. Systems Administrator, URCF, Drexel U.
> > http://www.drexel.edu/research/urcf/
> > https://linuxfollies.blogspot.com/
> > +1.215.221.4747 (mobile)
> > https://github.com/prehensilecode
> >
> >
> >
> ------------------------------------------------------------------------------
> >
> > _______________________________________________
> > Ganglia-general mailing list
> > Ganglia-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/ganglia-general
> >
>

------------------------------------------------------------------------------

_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] missing data on large clusters

Reply via email to