[Ganglia-general] gmond host no longer collecting data from other nodes

Christopher D Cprek Thu, 14 Jul 2011 09:05:56 -0700

Hello all,  

I'm hoping someone can point me in the right direction. After an
upgrade to Red Hat EL 5.6 earlier this week, my gmon collector service
is only showing the localhost and none of the gmon multicast traffic
from the other nodes. I can see the multicast traffic getting to this
server, but 'gstat -a' lists nothing but itself.


It's quite bizarre because my ganglia nodes didn't disappear until
about 24 hours after the upgrade had been completed. I'm at a loss as to
what could have happened. I've verified that selinux is disabled and the
issue persists. IPtables have been disabled (just in case) and it
persists. The gmond.conf file under /etc/ganglia/gmond.conf was copied
back directly from backup *and* known working nodes. Note: the
non-upgraded nodes list all other nodes with a 'gstat -a' correctly. 

I'm really at a loss here. Any pointers on where to look are
appreciated. I'm hoping it's something simple I'm stupidly overlooking.
Relevant info below. Thanks in advance! 

--gmon multicast traffic is making it to this gmon collector 

# tcpdump -i any ip multicast 
tcpdump: WARNING: Promiscuous mode not supported on the "any" device 
tcpdump: verbose output suppressed, use -v or -vv for full protocol
decode 
listening on any, link-type LINUX_SLL (Linux cooked), capture size 96
bytes 
11:31:16.997129 IP mgt1.36677 > 239.2.11.71.8649: UDP, length 44 
11:31:22.913169 IP node011.45309 > 239.2.11.71.8649: UDP, length 52 
11:31:22.913177 IP node011.45309 > 239.2.11.71.8649: UDP, length 52 
11:31:22.913279 IP node006.50998 > 239.2.11.71.8649: UDP, length 52 
11:31:22.913285 IP node006.50998 > 239.2.11.71.8649: UDP, length 52 
11:31:22.913503 IP node004.48911 > 239.2.11.71.8649: UDP, length 52 
11:31:22.913511 IP node004.48911 > 239.2.11.71.8649: UDP, length 52 
11:31:22.916303 IP node005.56330 > 239.2.11.71.8649: UDP, length 48 
11:31:22.918336 IP node006.50998 > 239.2.11.71.8649: UDP, length 48 
*snip* 

--But it's not seeing any other nodes besides itself 

# gstat -a 
CLUSTER INFORMATION 
       Name: Cluster 
      Hosts: 1 
Gexec Hosts: 0 
 Dead Hosts: 0 
  Localtime: Thu Jul 14 11:32:29 2011 

CLUSTER HOSTS 
Hostname                     LOAD                       CPU            
 Gexec 
 CPUs (Procs/Total) [     1,     5, 15min] [  User,  Nice, System,
Idle, Wio] 

mgn2 
    8 (    0/  618) [  0.15,  0.36,  0.48] [   0.6,   0.0,   0.2, 
98.8,   0.4] OFF 

--gmetad.conf 

# cat gmetad.conf |grep -v "#" 

data_source "Cluster" 10 localhost 

all_trusted on 

--gmond.conf after I've added in receive channels for *every*
interface. This wasn't in the original working config, but I was trying
anything at this point. 

# cat gmond.conf  
/* This configuration is as close to 2.5.x default behavior as possible

   The values closely match ./gmond/metric.h definitions in 2.5.x */ 
globals { 
  daemonize = yes 
  setuid = yes 
  user = nobody 
  debug_level = 2 
  max_udp_msg_len = 1472 
  mute = no 
  deaf = no 
  allow_extra_data = yes 
  host_dmax = 0 /*secs */ 
  cleanup_threshold = 300 /*secs */ 
  gexec = no 
  send_metadata_interval = 0 /*secs */ 
} 

/* 
 * The cluster attributes specified will be used as part of the
<CLUSTER> 
 * tag that will wrap all hosts collected by this instance. 
 */ 
cluster { 
  name = "Cluster" 
  owner = "n/a" 
  latlong = "n/a" 
  url = "n/a" 
} 

/* The host section describes attributes of the host, like the location
*/ 
host { 
  location = "n/a" 
} 

/* Feel free to specify as many udp_send_channels as you like.  Gmond 
   used to only support having a single channel */ 
udp_send_channel { 
  mcast_join = 239.2.11.71 
  port = 8649 
  ttl = 1 
  mcast_if=eth0 
} 

/* You can specify as many udp_recv_channels as you like as well. */ 
udp_recv_channel { 
  mcast_join = 239.2.11.71  
  port = 8649 
  bind = 239.2.11.71 
  mcast_if=eth0  
} 

udp_recv_channel { 
  mcast_join = 239.2.11.71  
  port = 8649 
  bind = 239.2.11.71 
  mcast_if=eth0:1  
} 

udp_recv_channel { 
  mcast_join = 239.2.11.71  
  port = 8649 
  bind = 239.2.11.71 
  mcast_if=eth1  
} 

udp_recv_channel { 
  mcast_join = 239.2.11.71  
  port = 8649 
  bind = 239.2.11.71 
  mcast_if=eth2  
} 

udp_recv_channel { 
  mcast_join = 239.2.11.71  
  port = 8649 
  bind = 239.2.11.71 
  mcast_if=eth3  
} 
*snip modules* 

--debug start-up output 
# service gmond restart 
Shutting down GANGLIA gmond:                               [  OK  ] 
Starting GANGLIA gmond: loaded module: core_metrics 
loaded module: cpu_module 
loaded module: disk_module 
loaded module: load_module 
loaded module: mem_module 
loaded module: net_module 
loaded module: proc_module 
loaded module: sys_module 
udp_recv_channel mcast_join=239.2.11.71 mcast_if=eth0 port=8649
bind=239.2.11.71 
udp_recv_channel mcast_join=239.2.11.71 mcast_if=eth0:1 port=8649
bind=239.2.11.71 
udp_recv_channel mcast_join=239.2.11.71 mcast_if=eth1 port=8649
bind=239.2.11.71 
udp_recv_channel mcast_join=239.2.11.71 mcast_if=eth2 port=8649
bind=239.2.11.71 
udp_recv_channel mcast_join=239.2.11.71 mcast_if=eth3 port=8649
bind=239.2.11.71 
tcp_accept_channel bind=NULL port=8649 
udp_send_channel mcast_join=239.2.11.71 mcast_if=eth0 host=NULL
port=8649 

metric 'cpu_user' being collected now 
metric 'cpu_user' has value_threshold 1.000000

------------------------------------------------------------------------------
AppSumo Presents a FREE Video for the SourceForge Community by Eric 
Ries, the creator of the Lean Startup Methodology on "Lean Startup 
Secrets Revealed." This video shows you how to validate your ideas, 
optimize your ideas and identify your business strategy.
http://p.sf.net/sfu/appsumosfdev2dev

_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

[Ganglia-general] gmond host no longer collecting data from other nodes

Reply via email to