Does increasing gmetad's debug level (runs in foreground) yield anything useful?



On 19 August 2015 at 21:15, Ludmil Stamboliyski <l.stamboliy...@ucdn.com> wrote:
> Hi Bostjan and thank you for your time,
>
> My setup is:
> gmond deamons for each machine monitoring configured in unicast, 8 clusters,
> and one master node on which I have gmond daemon for each cluster running on
> different port. On the master node I have gmeta daemon configured to send
> data to carbon-cache and with rrds off (actually I have second gmeta for
> rrds which is turned off while I am investigating this issue). All hosts are
> present in xml and they get their field "Reported" changed on every run, so
> I think gmond collector works correctly.
>
> 2015-08-19 14:48 GMT+03:00 Bostjan Skufca <bost...@a2o.si>:
>>
>> Ludmil,
>>
>> do you have multiple headnodes? Do they receive data from all the
>> nodes? If yes, did you verify it (telnet to each headnode to port 8649
>> and count occurences of <HOST...> xml tag)?
>>
>> b.
>>
>>
>> On 19 August 2015 at 12:01, Ludmil Stamboliyski <l.stamboliy...@ucdn.com>
>> wrote:
>> > Thank you Dave,
>> >
>> > I've done that, but to no avail.  Then I do the following - ran separate
>> > gmeta for this cluster - up to no avail. Then I thought why I do not
>> > make
>> > gmeta pull data each second:
>> > data_source "example large cluster" 1 127.0.0.1:port
>> >
>> > And it seems almost working now - i got data every two minutes. So
>> > clearly
>> > bottlenect is between gmond collector and gmeta - any ideas how to
>> > improve
>> > things there?
>> > Gmeta runs with rrds off, it sends data to carbon server. Also it has
>> > memcached.
>> >
>> >
>> > On вт, авг 18, 2015 at 6:04 , David Chin <david.c...@drexel.edu> wrote:
>> >
>> > Hi Ludmil:
>> >
>> > I had a similar problem a couple of years ago on a cluster with about
>> > 200
>> > nodes.
>> >
>> > Currently, in a new place, I have about 120 nodes. running Ganglia
>> > 3.6.1.
>> > The difference in the new cluster was changing "globals {
>> > send_metadata_interval }" from 0 to 120, which you already have. The
>> > following is the globals on the aggregator gmond:
>> >
>> > globals {
>> >   daemonize = yes
>> >   setuid = yes
>> >   user = nobody
>> >   debug_level = 0
>> >   max_udp_msg_len = 1472
>> >   mute = no
>> >   deaf = no
>> >   allow_extra_data = yes
>> >   host_dmax = 86400 /*secs. Expires (removes from web interface) hosts
>> > in 1
>> > day */
>> >   host_tmax = 20 /*secs */
>> >   cleanup_threshold = 300 /*secs */
>> >   gexec = no
>> >   # If you are not using multicast this value should be set to something
>> > other than 0.
>> >   # Otherwise if you restart aggregator gmond you will get empty graphs.
>> > 60
>> > seconds is reasonable
>> >   send_metadata_interval = 60 /*secs */
>> > }
>> >
>> > I also increased the UDP buffer size on the aggregator, to the value set
>> > in
>> > the kernel " sysctl net.core.rmem_max":
>> >
>> >      udp_recv_channel { ... buffer = 4194304 }
>> >
>> > On the gmetad, I use memcached. It only runs the default 4 threads.
>> >
>> > Good luck,
>> >     Dave
>> >
>> > On Tue, Aug 18, 2015 at 7:25 AM, Ludmil Stamboliyski
>> > <l.stamboliy...@ucdn.com> wrote:
>> >>
>> >> Hello,
>> >>
>> >> I am testing deploing ganglia to monitor our servers. I have several
>> >> clusters - most of them are small ones, but I do have two large ones -
>> >> with
>> >> over 150 machines to monitor. The issue is that I do not receive all
>> >> monitoring data from the machines in large clusters - ganglia-web
>> >> reports
>> >> clusters down, in graphite and in rrd I see very few points with data
>> >> for
>> >> machines in this large clustes - so by my calculations 2/3 of the data
>> >> is
>> >> lost. I am using gmond in unicast mode. Here are examples of my
>> >> configs:
>> >>
>> >>
>> >> Example of config in a monitored server:
>> >>
>> >> globals {
>> >>   daemonize = yes
>> >>   setuid = yes
>> >>   user = ganglia
>> >>   debug_level = 0
>> >>   max_udp_msg_len = 1472
>> >>   mute = no
>> >>   deaf = no
>> >>   host_dmax = 86400 /*secs */
>> >>   cleanup_threshold = 300 /*secs */
>> >>   gexec = no
>> >>   send_metadata_interval = 60
>> >>   override_hostname = "<<!! HUMAN READABLE HOSTNAME !!>>"
>> >> }
>> >> cluster {
>> >>   name = "Example large cluster"
>> >>   owner = "unspecified"
>> >>   latlong = "unspecified"
>> >>   url = "unspecified"
>> >> }
>> >> udp_send_channel {
>> >>   host = ip.addr.of.master
>> >>   port = 8654
>> >>   ttl = 1
>> >> }
>> >> udp_recv_channel {
>> >>   port = 8649
>> >> }
>> >> tcp_accept_channel {
>> >>   port = 8649
>> >> }
>> >> # Metric conf follows ...
>> >>
>> >> Example of config of gmond collector on master node:
>> >>
>> >> globals {
>> >>   daemonize = yes
>> >>   setuid = yes
>> >>   user = ganglia
>> >>   debug_level = 0
>> >>   max_udp_msg_len = 1472
>> >>   mute = no
>> >>   deaf = no
>> >>   allow_extra_data = yes
>> >>   host_dmax = 86400 /*secs */
>> >>   cleanup_threshold = 300 /*secs */
>> >>   gexec = no
>> >>   send_metadata_interval = 120
>> >> }
>> >> cluster {
>> >>   name = "Example large cluster"
>> >>   owner = "unspecified"
>> >>   latlong = "unspecified"
>> >>   url = "unspecified"
>> >> }
>> >> udp_send_channel {
>> >>   host = localhost
>> >>   port = 8654
>> >>   ttl = 1
>> >> }
>> >> udp_recv_channel {
>> >>   port = 8654
>> >> }
>> >> tcp_accept_channel {
>> >>   port = 8654
>> >> }
>> >>
>> >>
>> >> And here is example of my gmetad.con:f:
>> >>
>> >> data_source ...
>> >> data_source "Example large cluster" localhost:8654
>> >> data_source ...
>> >>
>> >> server_threads 16
>> >>
>> >>
>> >> In logs I see a lots of "Error 1 sending the modular data data_source"
>> >> -
>> >> searched various threads but did not found anything helpful.
>> >> I checked the network settings and tuned the udp accordingly - the
>> >> server
>> >> do not drop packets, also checked on the switch - there are no drops
>> >> and
>> >> loses. Load is rarely seen above 1.5 and this is 16 core server with
>> >> 128GB
>> >> of ram. I ran the collector and gmeta in debug and it seemed fine.
>> >>
>> >> I am really lost, so I'll be grateful for any help.
>> >>
>> >>
>> >>
>> >>
>> >> ------------------------------------------------------------------------------
>> >>
>> >> _______________________________________________
>> >> Ganglia-general mailing list
>> >> Ganglia-general@lists.sourceforge.net
>> >> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>> >>
>> >
>> >
>> >
>> > --
>> > David Chin, Ph.D.
>> > david.c...@drexel.edu    Sr. Systems Administrator, URCF, Drexel U.
>> > http://www.drexel.edu/research/urcf/
>> > https://linuxfollies.blogspot.com/
>> > +1.215.221.4747 (mobile)
>> > https://github.com/prehensilecode
>> >
>> >
>> >
>> > ------------------------------------------------------------------------------
>> >
>> > _______________________________________________
>> > Ganglia-general mailing list
>> > Ganglia-general@lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/ganglia-general
>> >
>
>

------------------------------------------------------------------------------
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to