subject:"\[Ganglia\-general\] missing data on large clusters"

Re: [Ganglia-general] missing data on large clusters

2015-08-19 Thread Bostjan Skufca

Ludmil,

do you have multiple headnodes? Do they receive data from all the
nodes? If yes, did you verify it (telnet to each headnode to port 8649
and count occurences of HOST... xml tag)?

b.


On 19 August 2015 at 12:01, Ludmil Stamboliyski l.stamboliy...@ucdn.com wrote:
 Thank you Dave,

 I've done that, but to no avail.  Then I do the following - ran separate
 gmeta for this cluster - up to no avail. Then I thought why I do not make
 gmeta pull data each second:
 data_source example large cluster 1 127.0.0.1:port

 And it seems almost working now - i got data every two minutes. So clearly
 bottlenect is between gmond collector and gmeta - any ideas how to improve
 things there?
 Gmeta runs with rrds off, it sends data to carbon server. Also it has
 memcached.


 On вт, авг 18, 2015 at 6:04 , David Chin david.c...@drexel.edu wrote:

 Hi Ludmil:

 I had a similar problem a couple of years ago on a cluster with about 200
 nodes.

 Currently, in a new place, I have about 120 nodes. running Ganglia 3.6.1.
 The difference in the new cluster was changing globals {
 send_metadata_interval } from 0 to 120, which you already have. The
 following is the globals on the aggregator gmond:

 globals {
   daemonize = yes
   setuid = yes
   user = nobody
   debug_level = 0
   max_udp_msg_len = 1472
   mute = no
   deaf = no
   allow_extra_data = yes
   host_dmax = 86400 /*secs. Expires (removes from web interface) hosts in 1
 day */
   host_tmax = 20 /*secs */
   cleanup_threshold = 300 /*secs */
   gexec = no
   # If you are not using multicast this value should be set to something
 other than 0.
   # Otherwise if you restart aggregator gmond you will get empty graphs. 60
 seconds is reasonable
   send_metadata_interval = 60 /*secs */
 }

 I also increased the UDP buffer size on the aggregator, to the value set in
 the kernel  sysctl net.core.rmem_max:

  udp_recv_channel { ... buffer = 4194304 }

 On the gmetad, I use memcached. It only runs the default 4 threads.

 Good luck,
 Dave

 On Tue, Aug 18, 2015 at 7:25 AM, Ludmil Stamboliyski
 l.stamboliy...@ucdn.com wrote:

 Hello,

 I am testing deploing ganglia to monitor our servers. I have several
 clusters - most of them are small ones, but I do have two large ones - with
 over 150 machines to monitor. The issue is that I do not receive all
 monitoring data from the machines in large clusters - ganglia-web reports
 clusters down, in graphite and in rrd I see very few points with data for
 machines in this large clustes - so by my calculations 2/3 of the data is
 lost. I am using gmond in unicast mode. Here are examples of my configs:


 Example of config in a monitored server:

 globals {
   daemonize = yes
   setuid = yes
   user = ganglia
   debug_level = 0
   max_udp_msg_len = 1472
   mute = no
   deaf = no
   host_dmax = 86400 /*secs */
   cleanup_threshold = 300 /*secs */
   gexec = no
   send_metadata_interval = 60
   override_hostname = !! HUMAN READABLE HOSTNAME !!
 }
 cluster {
   name = Example large cluster
   owner = unspecified
   latlong = unspecified
   url = unspecified
 }
 udp_send_channel {
   host = ip.addr.of.master
   port = 8654
   ttl = 1
 }
 udp_recv_channel {
   port = 8649
 }
 tcp_accept_channel {
   port = 8649
 }
 # Metric conf follows ...

 Example of config of gmond collector on master node:

 globals {
   daemonize = yes
   setuid = yes
   user = ganglia
   debug_level = 0
   max_udp_msg_len = 1472
   mute = no
   deaf = no
   allow_extra_data = yes
   host_dmax = 86400 /*secs */
   cleanup_threshold = 300 /*secs */
   gexec = no
   send_metadata_interval = 120
 }
 cluster {
   name = Example large cluster
   owner = unspecified
   latlong = unspecified
   url = unspecified
 }
 udp_send_channel {
   host = localhost
   port = 8654
   ttl = 1
 }
 udp_recv_channel {
   port = 8654
 }
 tcp_accept_channel {
   port = 8654
 }


 And here is example of my gmetad.con:f:

 data_source ...
 data_source Example large cluster localhost:8654
 data_source ...

 server_threads 16


 In logs I see a lots of Error 1 sending the modular data data_source -
 searched various threads but did not found anything helpful.
 I checked the network settings and tuned the udp accordingly - the server
 do not drop packets, also checked on the switch - there are no drops and
 loses. Load is rarely seen above 1.5 and this is 16 core server with 128GB
 of ram. I ran the collector and gmeta in debug and it seemed fine.

 I am really lost, so I'll be grateful for any help.



 --

 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/ganglia-general




 --
 David Chin, Ph.D.
 david.c...@drexel.eduSr. Systems Administrator, URCF, Drexel U.
 http://www.drexel.edu/research/urcf/
 https://linuxfollies.blogspot.com/
 +1.215.221.4747 (mobile)
 https://github.com/prehensilecode

Re: [Ganglia-general] missing data on large clusters

2015-08-19 Thread Ludmil Stamboliyski

Hi Bostjan and thank you for your time,

My setup is:
gmond deamons for each machine monitoring configured in unicast, 8
clusters, and one master node on which I have gmond daemon for each cluster
running on different port. On the master node I have gmeta daemon
configured to send data to carbon-cache and with rrds off (actually I have
second gmeta for rrds which is turned off while I am investigating this
issue). All hosts are present in xml and they get their field Reported
changed on every run, so I think gmond collector works correctly.

2015-08-19 14:48 GMT+03:00 Bostjan Skufca bost...@a2o.si:

 Ludmil,

 do you have multiple headnodes? Do they receive data from all the
 nodes? If yes, did you verify it (telnet to each headnode to port 8649
 and count occurences of HOST... xml tag)?

 b.


 On 19 August 2015 at 12:01, Ludmil Stamboliyski l.stamboliy...@ucdn.com
 wrote:
  Thank you Dave,
 
  I've done that, but to no avail.  Then I do the following - ran separate
  gmeta for this cluster - up to no avail. Then I thought why I do not make
  gmeta pull data each second:
  data_source example large cluster 1 127.0.0.1:port
 
  And it seems almost working now - i got data every two minutes. So
 clearly
  bottlenect is between gmond collector and gmeta - any ideas how to
 improve
  things there?
  Gmeta runs with rrds off, it sends data to carbon server. Also it has
  memcached.
 
 
  On вт, авг 18, 2015 at 6:04 , David Chin david.c...@drexel.edu wrote:
 
  Hi Ludmil:
 
  I had a similar problem a couple of years ago on a cluster with about 200
  nodes.
 
  Currently, in a new place, I have about 120 nodes. running Ganglia 3.6.1.
  The difference in the new cluster was changing globals {
  send_metadata_interval } from 0 to 120, which you already have. The
  following is the globals on the aggregator gmond:
 
  globals {
daemonize = yes
setuid = yes
user = nobody
debug_level = 0
max_udp_msg_len = 1472
mute = no
deaf = no
allow_extra_data = yes
host_dmax = 86400 /*secs. Expires (removes from web interface) hosts
 in 1
  day */
host_tmax = 20 /*secs */
cleanup_threshold = 300 /*secs */
gexec = no
# If you are not using multicast this value should be set to something
  other than 0.
# Otherwise if you restart aggregator gmond you will get empty graphs.
 60
  seconds is reasonable
send_metadata_interval = 60 /*secs */
  }
 
  I also increased the UDP buffer size on the aggregator, to the value set
 in
  the kernel  sysctl net.core.rmem_max:
 
   udp_recv_channel { ... buffer = 4194304 }
 
  On the gmetad, I use memcached. It only runs the default 4 threads.
 
  Good luck,
  Dave
 
  On Tue, Aug 18, 2015 at 7:25 AM, Ludmil Stamboliyski
  l.stamboliy...@ucdn.com wrote:
 
  Hello,
 
  I am testing deploing ganglia to monitor our servers. I have several
  clusters - most of them are small ones, but I do have two large ones -
 with
  over 150 machines to monitor. The issue is that I do not receive all
  monitoring data from the machines in large clusters - ganglia-web
 reports
  clusters down, in graphite and in rrd I see very few points with data
 for
  machines in this large clustes - so by my calculations 2/3 of the data
 is
  lost. I am using gmond in unicast mode. Here are examples of my configs:
 
 
  Example of config in a monitored server:
 
  globals {
daemonize = yes
setuid = yes
user = ganglia
debug_level = 0
max_udp_msg_len = 1472
mute = no
deaf = no
host_dmax = 86400 /*secs */
cleanup_threshold = 300 /*secs */
gexec = no
send_metadata_interval = 60
override_hostname = !! HUMAN READABLE HOSTNAME !!
  }
  cluster {
name = Example large cluster
owner = unspecified
latlong = unspecified
url = unspecified
  }
  udp_send_channel {
host = ip.addr.of.master
port = 8654
ttl = 1
  }
  udp_recv_channel {
port = 8649
  }
  tcp_accept_channel {
port = 8649
  }
  # Metric conf follows ...
 
  Example of config of gmond collector on master node:
 
  globals {
daemonize = yes
setuid = yes
user = ganglia
debug_level = 0
max_udp_msg_len = 1472
mute = no
deaf = no
allow_extra_data = yes
host_dmax = 86400 /*secs */
cleanup_threshold = 300 /*secs */
gexec = no
send_metadata_interval = 120
  }
  cluster {
name = Example large cluster
owner = unspecified
latlong = unspecified
url = unspecified
  }
  udp_send_channel {
host = localhost
port = 8654
ttl = 1
  }
  udp_recv_channel {
port = 8654
  }
  tcp_accept_channel {
port = 8654
  }
 
 
  And here is example of my gmetad.con:f:
 
  data_source ...
  data_source Example large cluster localhost:8654
  data_source ...
 
  server_threads 16
 
 
  In logs I see a lots of Error 1 sending the modular data data_source -
  searched various threads but did not found anything helpful.
  I checked the network settings and tuned the udp accordingly - the

Re: [Ganglia-general] missing data on large clusters

2015-08-19 Thread Bostjan Skufca

Does increasing gmetad's debug level (runs in foreground) yield anything useful?



On 19 August 2015 at 21:15, Ludmil Stamboliyski l.stamboliy...@ucdn.com wrote:
 Hi Bostjan and thank you for your time,

 My setup is:
 gmond deamons for each machine monitoring configured in unicast, 8 clusters,
 and one master node on which I have gmond daemon for each cluster running on
 different port. On the master node I have gmeta daemon configured to send
 data to carbon-cache and with rrds off (actually I have second gmeta for
 rrds which is turned off while I am investigating this issue). All hosts are
 present in xml and they get their field Reported changed on every run, so
 I think gmond collector works correctly.

 2015-08-19 14:48 GMT+03:00 Bostjan Skufca bost...@a2o.si:

 Ludmil,

 do you have multiple headnodes? Do they receive data from all the
 nodes? If yes, did you verify it (telnet to each headnode to port 8649
 and count occurences of HOST... xml tag)?

 b.


 On 19 August 2015 at 12:01, Ludmil Stamboliyski l.stamboliy...@ucdn.com
 wrote:
  Thank you Dave,
 
  I've done that, but to no avail.  Then I do the following - ran separate
  gmeta for this cluster - up to no avail. Then I thought why I do not
  make
  gmeta pull data each second:
  data_source example large cluster 1 127.0.0.1:port
 
  And it seems almost working now - i got data every two minutes. So
  clearly
  bottlenect is between gmond collector and gmeta - any ideas how to
  improve
  things there?
  Gmeta runs with rrds off, it sends data to carbon server. Also it has
  memcached.
 
 
  On вт, авг 18, 2015 at 6:04 , David Chin david.c...@drexel.edu wrote:
 
  Hi Ludmil:
 
  I had a similar problem a couple of years ago on a cluster with about
  200
  nodes.
 
  Currently, in a new place, I have about 120 nodes. running Ganglia
  3.6.1.
  The difference in the new cluster was changing globals {
  send_metadata_interval } from 0 to 120, which you already have. The
  following is the globals on the aggregator gmond:
 
  globals {
daemonize = yes
setuid = yes
user = nobody
debug_level = 0
max_udp_msg_len = 1472
mute = no
deaf = no
allow_extra_data = yes
host_dmax = 86400 /*secs. Expires (removes from web interface) hosts
  in 1
  day */
host_tmax = 20 /*secs */
cleanup_threshold = 300 /*secs */
gexec = no
# If you are not using multicast this value should be set to something
  other than 0.
# Otherwise if you restart aggregator gmond you will get empty graphs.
  60
  seconds is reasonable
send_metadata_interval = 60 /*secs */
  }
 
  I also increased the UDP buffer size on the aggregator, to the value set
  in
  the kernel  sysctl net.core.rmem_max:
 
   udp_recv_channel { ... buffer = 4194304 }
 
  On the gmetad, I use memcached. It only runs the default 4 threads.
 
  Good luck,
  Dave
 
  On Tue, Aug 18, 2015 at 7:25 AM, Ludmil Stamboliyski
  l.stamboliy...@ucdn.com wrote:
 
  Hello,
 
  I am testing deploing ganglia to monitor our servers. I have several
  clusters - most of them are small ones, but I do have two large ones -
  with
  over 150 machines to monitor. The issue is that I do not receive all
  monitoring data from the machines in large clusters - ganglia-web
  reports
  clusters down, in graphite and in rrd I see very few points with data
  for
  machines in this large clustes - so by my calculations 2/3 of the data
  is
  lost. I am using gmond in unicast mode. Here are examples of my
  configs:
 
 
  Example of config in a monitored server:
 
  globals {
daemonize = yes
setuid = yes
user = ganglia
debug_level = 0
max_udp_msg_len = 1472
mute = no
deaf = no
host_dmax = 86400 /*secs */
cleanup_threshold = 300 /*secs */
gexec = no
send_metadata_interval = 60
override_hostname = !! HUMAN READABLE HOSTNAME !!
  }
  cluster {
name = Example large cluster
owner = unspecified
latlong = unspecified
url = unspecified
  }
  udp_send_channel {
host = ip.addr.of.master
port = 8654
ttl = 1
  }
  udp_recv_channel {
port = 8649
  }
  tcp_accept_channel {
port = 8649
  }
  # Metric conf follows ...
 
  Example of config of gmond collector on master node:
 
  globals {
daemonize = yes
setuid = yes
user = ganglia
debug_level = 0
max_udp_msg_len = 1472
mute = no
deaf = no
allow_extra_data = yes
host_dmax = 86400 /*secs */
cleanup_threshold = 300 /*secs */
gexec = no
send_metadata_interval = 120
  }
  cluster {
name = Example large cluster
owner = unspecified
latlong = unspecified
url = unspecified
  }
  udp_send_channel {
host = localhost
port = 8654
ttl = 1
  }
  udp_recv_channel {
port = 8654
  }
  tcp_accept_channel {
port = 8654
  }
 
 
  And here is example of my gmetad.con:f:
 
  data_source ...
  data_source Example large cluster localhost:8654
  data_source ...
 
  server_threads 16
 
 
  In logs I see

Re: [Ganglia-general] missing data on large clusters

2015-08-19 Thread Ludmil Stamboliyski

Ok guys, thanks to your help we could count this resolved. For anyone who
wants to use graphite and carbon-cache - here is a peace of advice - run
separate gmeta daemon dedicated only to feeding carbon. The key is to set
up carbon and gmeta to communicate by udp - that gave me tripple increase
of received metrics. I am still diggin into what the hell is wrong with
ubuntu tcp stack, but alas it is working fine with udp.

2015-08-19 23:34 GMT+03:00 Ludmil Stamboliyski l.stamboliy...@ucdn.com:

 So... I got the culprit - it turns out that carbon-cache is slowing down
 the whole gmeta daemon... Now with pool interval 1 and rrd it finally
 became stable and began to load the machine as expected. Next thing to
 answer is why is carbon so slow?

 2015-08-19 22:59 GMT+03:00 Bostjan Skufca bost...@a2o.si:

 Does increasing gmetad's debug level (runs in foreground) yield anything
 useful?



 On 19 August 2015 at 21:15, Ludmil Stamboliyski l.stamboliy...@ucdn.com
 wrote:
  Hi Bostjan and thank you for your time,
 
  My setup is:
  gmond deamons for each machine monitoring configured in unicast, 8
 clusters,
  and one master node on which I have gmond daemon for each cluster
 running on
  different port. On the master node I have gmeta daemon configured to
 send
  data to carbon-cache and with rrds off (actually I have second gmeta for
  rrds which is turned off while I am investigating this issue). All
 hosts are
  present in xml and they get their field Reported changed on every
 run, so
  I think gmond collector works correctly.
 
  2015-08-19 14:48 GMT+03:00 Bostjan Skufca bost...@a2o.si:
 
  Ludmil,
 
  do you have multiple headnodes? Do they receive data from all the
  nodes? If yes, did you verify it (telnet to each headnode to port 8649
  and count occurences of HOST... xml tag)?
 
  b.
 
 
  On 19 August 2015 at 12:01, Ludmil Stamboliyski 
 l.stamboliy...@ucdn.com
  wrote:
   Thank you Dave,
  
   I've done that, but to no avail.  Then I do the following - ran
 separate
   gmeta for this cluster - up to no avail. Then I thought why I do not
   make
   gmeta pull data each second:
   data_source example large cluster 1 127.0.0.1:port
  
   And it seems almost working now - i got data every two minutes. So
   clearly
   bottlenect is between gmond collector and gmeta - any ideas how to
   improve
   things there?
   Gmeta runs with rrds off, it sends data to carbon server. Also it has
   memcached.
  
  
   On вт, авг 18, 2015 at 6:04 , David Chin david.c...@drexel.edu
 wrote:
  
   Hi Ludmil:
  
   I had a similar problem a couple of years ago on a cluster with about
   200
   nodes.
  
   Currently, in a new place, I have about 120 nodes. running Ganglia
   3.6.1.
   The difference in the new cluster was changing globals {
   send_metadata_interval } from 0 to 120, which you already have. The
   following is the globals on the aggregator gmond:
  
   globals {
 daemonize = yes
 setuid = yes
 user = nobody
 debug_level = 0
 max_udp_msg_len = 1472
 mute = no
 deaf = no
 allow_extra_data = yes
 host_dmax = 86400 /*secs. Expires (removes from web interface)
 hosts
   in 1
   day */
 host_tmax = 20 /*secs */
 cleanup_threshold = 300 /*secs */
 gexec = no
 # If you are not using multicast this value should be set to
 something
   other than 0.
 # Otherwise if you restart aggregator gmond you will get empty
 graphs.
   60
   seconds is reasonable
 send_metadata_interval = 60 /*secs */
   }
  
   I also increased the UDP buffer size on the aggregator, to the value
 set
   in
   the kernel  sysctl net.core.rmem_max:
  
udp_recv_channel { ... buffer = 4194304 }
  
   On the gmetad, I use memcached. It only runs the default 4 threads.
  
   Good luck,
   Dave
  
   On Tue, Aug 18, 2015 at 7:25 AM, Ludmil Stamboliyski
   l.stamboliy...@ucdn.com wrote:
  
   Hello,
  
   I am testing deploing ganglia to monitor our servers. I have several
   clusters - most of them are small ones, but I do have two large
 ones -
   with
   over 150 machines to monitor. The issue is that I do not receive all
   monitoring data from the machines in large clusters - ganglia-web
   reports
   clusters down, in graphite and in rrd I see very few points with
 data
   for
   machines in this large clustes - so by my calculations 2/3 of the
 data
   is
   lost. I am using gmond in unicast mode. Here are examples of my
   configs:
  
  
   Example of config in a monitored server:
  
   globals {
 daemonize = yes
 setuid = yes
 user = ganglia
 debug_level = 0
 max_udp_msg_len = 1472
 mute = no
 deaf = no
 host_dmax = 86400 /*secs */
 cleanup_threshold = 300 /*secs */
 gexec = no
 send_metadata_interval = 60
 override_hostname = !! HUMAN READABLE HOSTNAME !!
   }
   cluster {
 name = Example large cluster
 owner = unspecified
 latlong = unspecified
 url = unspecified
   }
   udp_send_channel {
 host =

Re: [Ganglia-general] missing data on large clusters

2015-08-19 Thread Ludmil Stamboliyski


Thank you Dave,

I've done that, but to no avail.  Then I do the following - ran 
separate gmeta for this cluster - up to no avail. Then I thought why I 
do not make gmeta pull data each second:

data_source example large cluster 1 127.0.0.1:port

And it seems almost working now - i got data every two minutes. So 
clearly bottlenect is between gmond collector and gmeta - any ideas how 
to improve things there?
Gmeta runs with rrds off, it sends data to carbon server. Also it has 
memcached.



On вт, авг 18, 2015 at 6:04 , David Chin david.c...@drexel.edu 
wrote:

Hi Ludmil:

I had a similar problem a couple of years ago on a cluster with about 
200 nodes.


Currently, in a new place, I have about 120 nodes. running Ganglia 
3.6.1. The difference in the new cluster was changing globals { 
send_metadata_interval } from 0 to 120, which you already have. The 
following is the globals on the aggregator gmond:


globals {
  daemonize = yes
  setuid = yes
  user = nobody
  debug_level = 0
  max_udp_msg_len = 1472
  mute = no
  deaf = no
  allow_extra_data = yes
  host_dmax = 86400 /*secs. Expires (removes from web interface) 
hosts in 1 day */

  host_tmax = 20 /*secs */
  cleanup_threshold = 300 /*secs */
  gexec = no
  # If you are not using multicast this value should be set to 
something other than 0.
  # Otherwise if you restart aggregator gmond you will get empty 
graphs. 60 seconds is reasonable

  send_metadata_interval = 60 /*secs */
}

I also increased the UDP buffer size on the aggregator, to the value 
set in the kernel  sysctl net.core.rmem_max:


 udp_recv_channel { ... buffer = 4194304 }

On the gmetad, I use memcached. It only runs the default 4 threads.

Good luck,
Dave

On Tue, Aug 18, 2015 at 7:25 AM, Ludmil Stamboliyski 
l.stamboliy...@ucdn.com wrote:

Hello,

I am testing deploing ganglia to monitor our servers. I have several 
clusters - most of them are small ones, but I do have two large ones 
- with over 150 machines to monitor. The issue is that I do not 
receive all monitoring data from the machines in large clusters - 
ganglia-web reports clusters down, in graphite and in rrd I see very 
few points with data for machines in this large clustes - so by my 
calculations 2/3 of the data is lost. I am using gmond in unicast 
mode. Here are examples of my configs:



Example of config in a monitored server:

globals {
  daemonize = yes
  setuid = yes
  user = ganglia
  debug_level = 0
  max_udp_msg_len = 1472
  mute = no
  deaf = no
  host_dmax = 86400 /*secs */
  cleanup_threshold = 300 /*secs */
  gexec = no
  send_metadata_interval = 60
  override_hostname = !! HUMAN READABLE HOSTNAME !!
}
cluster {
  name = Example large cluster
  owner = unspecified
  latlong = unspecified
  url = unspecified
}
udp_send_channel {
  host = ip.addr.of.master
  port = 8654
  ttl = 1
}
udp_recv_channel {
  port = 8649
}
tcp_accept_channel {
  port = 8649
}
# Metric conf follows ...

Example of config of gmond collector on master node:

globals {
  daemonize = yes
  setuid = yes
  user = ganglia
  debug_level = 0
  max_udp_msg_len = 1472
  mute = no
  deaf = no
  allow_extra_data = yes
  host_dmax = 86400 /*secs */
  cleanup_threshold = 300 /*secs */
  gexec = no
  send_metadata_interval = 120
}
cluster {
  name = Example large cluster
  owner = unspecified
  latlong = unspecified
  url = unspecified
}
udp_send_channel {
  host = localhost
  port = 8654
  ttl = 1
}
udp_recv_channel {
  port = 8654
}
tcp_accept_channel {
  port = 8654
}


And here is example of my gmetad.con:f:

data_source ...
data_source Example large cluster localhost:8654
data_source ...

server_threads 16


In logs I see a lots of Error 1 sending the modular data 
data_source - searched various threads but did not found anything 
helpful.
I checked the network settings and tuned the udp accordingly - the 
server do not drop packets, also checked on the switch - there are 
no drops and loses. Load is rarely seen above 1.5 and this is 16 
core server with 128GB of ram. I ran the collector and gmeta in 
debug and it seemed fine.


I am really lost, so I'll be grateful for any help.


--

___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general





--
David Chin, Ph.D.
david.c...@drexel.eduSr. Systems Administrator, URCF, Drexel U.
http://www.drexel.edu/research/urcf/
https://linuxfollies.blogspot.com/
+1.215.221.4747 (mobile)
https://github.com/prehensilecode


--
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

[Ganglia-general] missing data on large clusters

2015-08-18 Thread Ludmil Stamboliyski


Hello,

I am testing deploing ganglia to monitor our servers. I have several 
clusters - most of them are small ones, but I do have two large ones - 
with over 150 machines to monitor. The issue is that I do not receive 
all monitoring data from the machines in large clusters - ganglia-web 
reports clusters down, in graphite and in rrd I see very few points 
with data for machines in this large clustes - so by my calculations 
2/3 of the data is lost. I am using gmond in unicast mode. Here are 
examples of my configs:



Example of config in a monitored server:

globals {
 daemonize = yes
 setuid = yes
 user = ganglia
 debug_level = 0
 max_udp_msg_len = 1472
 mute = no
 deaf = no
 host_dmax = 86400 /*secs */
 cleanup_threshold = 300 /*secs */
 gexec = no
 send_metadata_interval = 60
 override_hostname = !! HUMAN READABLE HOSTNAME !!
}
cluster {
 name = Example large cluster
 owner = unspecified
 latlong = unspecified
 url = unspecified
}
udp_send_channel {
 host = ip.addr.of.master
 port = 8654
 ttl = 1
}
udp_recv_channel {
 port = 8649
}
tcp_accept_channel {
 port = 8649
}
# Metric conf follows ...

Example of config of gmond collector on master node:

globals {
 daemonize = yes
 setuid = yes
 user = ganglia
 debug_level = 0
 max_udp_msg_len = 1472
 mute = no
 deaf = no
 allow_extra_data = yes
 host_dmax = 86400 /*secs */
 cleanup_threshold = 300 /*secs */
 gexec = no
 send_metadata_interval = 120
}
cluster {
 name = Example large cluster
 owner = unspecified
 latlong = unspecified
 url = unspecified
}
udp_send_channel {
 host = localhost
 port = 8654
 ttl = 1
}
udp_recv_channel {
 port = 8654
}
tcp_accept_channel {
 port = 8654
}


And here is example of my gmetad.con:f:

data_source ...
data_source Example large cluster localhost:8654
data_source ...

server_threads 16


In logs I see a lots of Error 1 sending the modular data data_source 
- searched various threads but did not found anything helpful.
I checked the network settings and tuned the udp accordingly - the 
server do not drop packets, also checked on the switch - there are no 
drops and loses. Load is rarely seen above 1.5 and this is 16 core 
server with 128GB of ram. I ran the collector and gmeta in debug and it 
seemed fine.


I am really lost, so I'll be grateful for any help.


--
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] missing data on large clusters

2015-08-18 Thread David Chin

Hi Ludmil:

I had a similar problem a couple of years ago on a cluster with about 200
nodes.

Currently, in a new place, I have about 120 nodes. running Ganglia 3.6.1.
The difference in the new cluster was changing globals {
send_metadata_interval } from 0 to 120, which you already have. The
following is the globals on the aggregator gmond:

globals {
  daemonize = yes
  setuid = yes
  user = nobody
  debug_level = 0
  max_udp_msg_len = 1472
  mute = no
  deaf = no
  allow_extra_data = yes
  host_dmax = 86400 /*secs. Expires (removes from web interface) hosts in 1
day */
  host_tmax = 20 /*secs */
  cleanup_threshold = 300 /*secs */
  gexec = no
  # If you are not using multicast this value should be set to something
other than 0.
  # Otherwise if you restart aggregator gmond you will get empty graphs. 60
seconds is reasonable
  send_metadata_interval = 60 /*secs */
}

I also increased the UDP buffer size on the aggregator, to the value set in
the kernel  sysctl net.core.rmem_max:

 udp_recv_channel { ... buffer = 4194304 }

On the gmetad, I use memcached. It only runs the default 4 threads.

Good luck,
Dave

On Tue, Aug 18, 2015 at 7:25 AM, Ludmil Stamboliyski 
l.stamboliy...@ucdn.com wrote:

 Hello,

 I am testing deploing ganglia to monitor our servers. I have several
 clusters - most of them are small ones, but I do have two large ones - with
 over 150 machines to monitor. The issue is that I do not receive all
 monitoring data from the machines in large clusters - ganglia-web reports
 clusters down, in graphite and in rrd I see very few points with data for
 machines in this large clustes - so by my calculations 2/3 of the data is
 lost. I am using gmond in unicast mode. Here are examples of my configs:


 Example of config in a monitored server:

 globals {
   daemonize = yes
   setuid = yes
   user = ganglia
   debug_level = 0
   max_udp_msg_len = 1472
   mute = no
   deaf = no
   host_dmax = 86400 /*secs */
   cleanup_threshold = 300 /*secs */
   gexec = no
   send_metadata_interval = 60
   override_hostname = !! HUMAN READABLE HOSTNAME !!
 }
 cluster {
   name = Example large cluster
   owner = unspecified
   latlong = unspecified
   url = unspecified
 }
 udp_send_channel {
   host = ip.addr.of.master
   port = 8654
   ttl = 1
 }
 udp_recv_channel {
   port = 8649
 }
 tcp_accept_channel {
   port = 8649
 }
 # Metric conf follows ...

 Example of config of gmond collector on master node:

 globals {
   daemonize = yes
   setuid = yes
   user = ganglia
   debug_level = 0
   max_udp_msg_len = 1472
   mute = no
   deaf = no
   allow_extra_data = yes
   host_dmax = 86400 /*secs */
   cleanup_threshold = 300 /*secs */
   gexec = no
   send_metadata_interval = 120
 }
 cluster {
   name = Example large cluster
   owner = unspecified
   latlong = unspecified
   url = unspecified
 }
 udp_send_channel {
   host = localhost
   port = 8654
   ttl = 1
 }
 udp_recv_channel {
   port = 8654
 }
 tcp_accept_channel {
   port = 8654
 }


 And here is example of my gmetad.con:f:

 data_source ...
 data_source Example large cluster localhost:8654
 data_source ...

 server_threads 16


 In logs I see a lots of Error 1 sending the modular data data_source -
 searched various threads but did not found anything helpful.
 I checked the network settings and tuned the udp accordingly - the server
 do not drop packets, also checked on the switch - there are no drops and
 loses. Load is rarely seen above 1.5 and this is 16 core server with 128GB
 of ram. I ran the collector and gmeta in debug and it seemed fine.

 I am really lost, so I'll be grateful for any help.



 --

 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/ganglia-general




-- 
David Chin, Ph.D.
david.c...@drexel.eduSr. Systems Administrator, URCF, Drexel U.
http://www.drexel.edu/research/urcf/
https://linuxfollies.blogspot.com/
+1.215.221.4747 (mobile)
https://github.com/prehensilecode
--
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] missing data on large clusters

Re: [Ganglia-general] missing data on large clusters

Re: [Ganglia-general] missing data on large clusters

Re: [Ganglia-general] missing data on large clusters

Re: [Ganglia-general] missing data on large clusters

[Ganglia-general] missing data on large clusters

Re: [Ganglia-general] missing data on large clusters

7 matches

Site Navigation

Mail list logo

Footer information