When in doubt, use telnet.
See what telnetting to node02 from prism (which I assume is one of the head
systems) on port 8649 gets you. You should at the very least get a Ganglia
DTD and XML for one node. If you don't, something is really wrong
(congratulations, you found a weird bug). If you see data from only one
node, try the same thing on the other nodes - poll the others on that port
and see if you see *their* node information.
If, for example, node 1 only has data about itself, whereas nodes 2-4 have
information on all nodes except for node 1 ... well, you can probably
figure out where the disconnect is there. ;)
If all the nodes only have data about themselves, that suggests either a
misconfiguration (unlikely, since it was all working with the same config
before you upgraded ... *cough* right? :) ), a kernel issue (although I
can't think of anything off the top of my head in the kernel config or any
kflags) or a network issue. Try pinging the multicast IP (you should get a
response from every node that's listening on that IP) and your old friends,
tcpdump and a really long cat5 cable. As you've said, the XDR packets are
being *sent* by at least one of your nodes - you're watching them go out.
So, you run tcpdump on that host and make sure they actually make it out at
the kernel level. Then you go one hop down on the network and run tcpdump
again. You should still see them. Keep going one spigot at a time until
you've traversed the path from one node to another (this may be a short
trip if they're on the same switch of course :) ).
Also, remember that gmonds have a "trusted host" config directive and that
they will accept and immediately close connections from hosts *not* in that
list. To gmetad this will look like a successful connection with no data.
Phil Forrest wrote:
Thanks for the info Steve...I think I've found some part of the problem.
I pointed gmetad to the gmond on node02 instead of the local gmond.
Here's what I get:
(NOTE: prior to running this, node02 is running gmond in debug mode
level 2 and I am seeing data on node02 that gmond is putting to stdout)
[EMAIL PROTECTED] etc]# /usr/sbin/gmetad
Setting debug level to 2
Datasource = [Node02]
Trying to connect to 192.168.5.2:8649 for [Node02]
Data inserted for [Node02] into sources hash
Going to run as user nobody
Sources are ...
Source: [Node02] has 1 sources
192.168.5.2
listening on port 8651
3076 is monitoring [Node02] data source
192.168.5.2
save_to_rrd() XML_ParseBuffer() error at line 1:
no element found
data_thread() couldn't parse the XML and data to RRD for [Node02]
[Node02] is a dead source
save_to_rrd() XML_ParseBuffer() error at line 1:
no element found
Is this saying that there is no data coming from node02 or that it
cannot save to the RR database for some reason?
Additionally, I checked the firewall config on the internal compute
nodes. I'm no expert on iptables, but it looks like there are NO rules:
[EMAIL PROTECTED] etc]# iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
[EMAIL PROTECTED] etc]# ipchains -L
ipchains: Incompatible with this kernel
[EMAIL PROTECTED] etc]#
Note that node04 had the exact same output for iptables -L.
Additionally, I set hosts.allow to ALL:ALL on node02 and node04 then
repeated the test where node02 was running in deaf debug mode and node04
was running in mute debug mode. Node04 never received any data. It just
kept calling the cleanup thread.
I'm stumped....I guess the parse buffer error above means that there's
no buffer to parse?
Thanks again,
-Phil
At 12:46 PM 12/19/2002, you wrote:
Phil Forrest wrote:
Hello All,
Once upon a time, I had a happy ganglia monitor that was giving me
valuable data on all nodes of my 48 node cluster. Then I got a
request from a user to upgrade the kernel. After I upgraded the
kernels across the cluster, my ganglia could only see the data from
the gmond running on the head node (which also had gmetad and httpd
running).
The cluster is running Red Hat 7.3 with kernel 2.4.9-34smp #1 SMP Sat
Jun 1 05:54:57 EDT 2002 i686 unknown
My cluster has 46 compute nodes with one (eth0) interface and two
head nodes with two interfaces (eth0 and eth1) one for the private
lan and one for the campus network. My head node that has gmetad
running has "mcast_if eth1" set in its gmond.conf file. Here's the
/sbin/ifconfig slice for eth1 on the head node:
eth1 Link encap:Ethernet HWaddr 00:40:F4:2A:6E:26
inet addr:192.168.5.200 Bcast:192.168.5.255
Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:176581970 errors:0 dropped:0 overruns:0 frame:0
TX packets:160905314 errors:0 dropped:0 overruns:0 carrier:0
collisions:0
RX bytes:1187468116 (1132.4 Mb) TX bytes:2350492219
(2241.6 Mb)
Can I trust the output of /sbin/ifconfig (meaning, if /sbin/ifconfig
says MULTICAST is running, is that the REAL truth, or can the kernel
still suppress multicast transmissions??)
The kernel's firewalling configuration can still filter out multicast
traffic. Check your firewall config (man iptables :) ). If your
config is very restrictive, at least poke a li'l hole for the
multicast IP/port combo.
IIRC, the default iptables behavior changed a few point releases back
in Redhat - it's now on. This is apparently to help everyone who's
installing it on their desktop connected to the net via cable modem
from getting owned...
Also, gmetad cares not one whit about /etc/gmond.conf. I just did a
once-over on the code to make absolutely sure, there's no mention of
it. It's /etc/gmetad.conf that you should concern yourself with on the
head units if you're having display problems. Unless they're also
supposed to be part of the cluster, in which case you would configure
the gmonds separately.
Remember to open firewall ports for TCP port 8649 on hosts running the
monitoring core and TCP port 8651 for the hosts running gmetad.
The metadaemon should be determining the path to establish its
connections via the good ol' fashioned kernel routing table, just like
anything else.
As a test, I've been running gmond on one node in deaf debug mode,
and on another node in mute debug mode. The deaf one is pumping out
data successfully and the mute one is not seeing anything. Since this
is compute node to compute node, there can only be one interface
(eth0). There has to be something in the kernel config that is
screwing this up.
That sounds like it's a firewall config issue or a router/switch
config issue to me...
I'm wondering with all the kernel upgrades going on out there, maybe
someone has had similar issues? Thanks in advance for any info!
7.2 / 2.4.19smp on most of our nodes here, no reported problems with
the monitoring core on any of them.
Happy Holidays To All,
-Phil Forrest
Yeah, happy Life Day, kids. ;)
Hope this info proves useful...
Phil Forrest
334-844-6910
Auburn University Dept. of Physics
Network & Scientific Computing
207 Leach Science Center
-------------------------------------------------------
This SF.NET email is sponsored by: Geek Gift Procrastinating?
Get the perfect geek gift now! Before the Holidays pass you by.
T H I N K G E E K . C O M http://www.thinkgeek.com/sf/
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general