Thanks for the info Steve...I think I've found some part of the problem. I
pointed gmetad to the gmond on node02 instead of the local gmond. Here's
what I get:
(NOTE: prior to running this, node02 is running gmond in debug mode level 2
and I am seeing data on node02 that gmond is putting to stdout)
[EMAIL PROTECTED] etc]# /usr/sbin/gmetad
Setting debug level to 2
Datasource = [Node02]
Trying to connect to 192.168.5.2:8649 for [Node02]
Data inserted for [Node02] into sources hash
Going to run as user nobody
Sources are ...
Source: [Node02] has 1 sources
192.168.5.2
listening on port 8651
3076 is monitoring [Node02] data source
192.168.5.2
save_to_rrd() XML_ParseBuffer() error at line 1:
no element found
data_thread() couldn't parse the XML and data to RRD for [Node02]
[Node02] is a dead source
save_to_rrd() XML_ParseBuffer() error at line 1:
no element found
Is this saying that there is no data coming from node02 or that it cannot
save to the RR database for some reason?
Additionally, I checked the firewall config on the internal compute nodes.
I'm no expert on iptables, but it looks like there are NO rules:
[EMAIL PROTECTED] etc]# iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
[EMAIL PROTECTED] etc]# ipchains -L
ipchains: Incompatible with this kernel
[EMAIL PROTECTED] etc]#
Note that node04 had the exact same output for iptables -L. Additionally, I
set hosts.allow to ALL:ALL on node02 and node04 then repeated the test
where node02 was running in deaf debug mode and node04 was running in mute
debug mode. Node04 never received any data. It just kept calling the
cleanup thread.
I'm stumped....I guess the parse buffer error above means that there's no
buffer to parse?
Thanks again,
-Phil
At 12:46 PM 12/19/2002, you wrote:
Phil Forrest wrote:
Hello All,
Once upon a time, I had a happy ganglia monitor that was giving me
valuable data on all nodes of my 48 node cluster. Then I got a request
from a user to upgrade the kernel. After I upgraded the kernels across
the cluster, my ganglia could only see the data from the gmond running on
the head node (which also had gmetad and httpd running).
The cluster is running Red Hat 7.3 with kernel 2.4.9-34smp #1 SMP Sat Jun
1 05:54:57 EDT 2002 i686 unknown
My cluster has 46 compute nodes with one (eth0) interface and two head
nodes with two interfaces (eth0 and eth1) one for the private lan and one
for the campus network. My head node that has gmetad running has
"mcast_if eth1" set in its gmond.conf file. Here's the /sbin/ifconfig
slice for eth1 on the head node:
eth1 Link encap:Ethernet HWaddr 00:40:F4:2A:6E:26
inet addr:192.168.5.200 Bcast:192.168.5.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:176581970 errors:0 dropped:0 overruns:0 frame:0
TX packets:160905314 errors:0 dropped:0 overruns:0 carrier:0
collisions:0
RX bytes:1187468116 (1132.4 Mb) TX bytes:2350492219 (2241.6 Mb)
Can I trust the output of /sbin/ifconfig (meaning, if /sbin/ifconfig says
MULTICAST is running, is that the REAL truth, or can the kernel still
suppress multicast transmissions??)
The kernel's firewalling configuration can still filter out multicast
traffic. Check your firewall config (man iptables :) ). If your config
is very restrictive, at least poke a li'l hole for the multicast IP/port combo.
IIRC, the default iptables behavior changed a few point releases back in
Redhat - it's now on. This is apparently to help everyone who's
installing it on their desktop connected to the net via cable modem from
getting owned...
Also, gmetad cares not one whit about /etc/gmond.conf. I just did a
once-over on the code to make absolutely sure, there's no mention of it.
It's /etc/gmetad.conf that you should concern yourself with on the head
units if you're having display problems. Unless they're also supposed to
be part of the cluster, in which case you would configure the gmonds
separately.
Remember to open firewall ports for TCP port 8649 on hosts running the
monitoring core and TCP port 8651 for the hosts running gmetad.
The metadaemon should be determining the path to establish its connections
via the good ol' fashioned kernel routing table, just like anything else.
As a test, I've been running gmond on one node in deaf debug mode, and on
another node in mute debug mode. The deaf one is pumping out data
successfully and the mute one is not seeing anything. Since this is
compute node to compute node, there can only be one interface (eth0).
There has to be something in the kernel config that is screwing this up.
That sounds like it's a firewall config issue or a router/switch config
issue to me...
I'm wondering with all the kernel upgrades going on out there, maybe
someone has had similar issues? Thanks in advance for any info!
7.2 / 2.4.19smp on most of our nodes here, no reported problems with the
monitoring core on any of them.
Happy Holidays To All,
-Phil Forrest
Yeah, happy Life Day, kids. ;)
Hope this info proves useful...
Phil Forrest
334-844-6910
Auburn University Dept. of Physics
Network & Scientific Computing
207 Leach Science Center