Re: [Ganglia-general] Kernel Upgrade killed my happy Ganglia

Steven Wagner Thu, 19 Dec 2002 12:31:40 -0800

When in doubt, use telnet.

See what telnetting to node02 from prism (which I assume is one of the headsystems) on port 8649 gets you. You should at the very least get a GangliaDTD and XML for one node. If you don't, something is really wrong(congratulations, you found a weird bug). If you see data from only onenode, try the same thing on the other nodes - poll the others on that portand see if you see *their* node information.

If, for example, node 1 only has data about itself, whereas nodes 2-4 haveinformation on all nodes except for node 1 ... well, you can probablyfigure out where the disconnect is there. ;)

If all the nodes only have data about themselves, that suggests either amisconfiguration (unlikely, since it was all working with the same configbefore you upgraded ... *cough* right? :) ), a kernel issue (although Ican't think of anything off the top of my head in the kernel config or anykflags) or a network issue. Try pinging the multicast IP (you should get aresponse from every node that's listening on that IP) and your old friends,tcpdump and a really long cat5 cable. As you've said, the XDR packets arebeing *sent* by at least one of your nodes - you're watching them go out.So, you run tcpdump on that host and make sure they actually make it out atthe kernel level. Then you go one hop down on the network and run tcpdumpagain. You should still see them. Keep going one spigot at a time untilyou've traversed the path from one node to another (this may be a shorttrip if they're on the same switch of course :) ).

Also, remember that gmonds have a "trusted host" config directive and thatthey will accept and immediately close connections from hosts *not* in thatlist. To gmetad this will look like a successful connection with no data.


Phil Forrest wrote:

Thanks for the info Steve...I think I've found some part of the problem.I pointed gmetad to the gmond on node02 instead of the local gmond.Here's what I get:
(NOTE: prior to running this, node02 is running gmond in debug modelevel 2 and I am seeing data on node02 that gmond is putting to stdout)
[EMAIL PROTECTED] etc]# /usr/sbin/gmetad
Setting debug level to 2
Datasource = [Node02]
Trying to connect to 192.168.5.2:8649 for [Node02]
Data inserted for [Node02] into sources hash
Going to run as user nobody
Sources are ...
Source: [Node02] has 1 sources
        192.168.5.2
listening on port 8651
3076 is monitoring [Node02] data source
        192.168.5.2
save_to_rrd() XML_ParseBuffer() error at line 1:
no element found

data_thread() couldn't parse the XML and data to RRD for [Node02]
[Node02] is a dead source
save_to_rrd() XML_ParseBuffer() error at line 1:
no element found
Is this saying that there is no data coming from node02 or that itcannot save to the RR database for some reason?
Additionally, I checked the firewall config on the internal computenodes. I'm no expert on iptables, but it looks like there are NO rules:
[EMAIL PROTECTED] etc]# iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
[EMAIL PROTECTED] etc]# ipchains -L
ipchains: Incompatible with this kernel
[EMAIL PROTECTED] etc]#
Note that node04 had the exact same output for iptables -L.Additionally, I set hosts.allow to ALL:ALL on node02 and node04 thenrepeated the test where node02 was running in deaf debug mode and node04was running in mute debug mode. Node04 never received any data. It justkept calling the cleanup thread.
I'm stumped....I guess the parse buffer error above means that there'sno buffer to parse?
Thanks again,
-Phil

At 12:46 PM 12/19/2002, you wrote:
Phil Forrest wrote:
Hello All,
Once upon a time, I had a happy ganglia monitor that was giving mevaluable data on all nodes of my 48 node cluster. Then I got arequest from a user to upgrade the kernel. After I upgraded thekernels across the cluster, my ganglia could only see the data fromthe gmond running on the head node (which also had gmetad and httpdrunning).The cluster is running Red Hat 7.3 with kernel 2.4.9-34smp #1 SMP SatJun 1 05:54:57 EDT 2002 i686 unknownMy cluster has 46 compute nodes with one (eth0) interface and twohead nodes with two interfaces (eth0 and eth1) one for the privatelan and one for the campus network. My head node that has gmetadrunning has "mcast_if eth1" set in its gmond.conf file. Here's the/sbin/ifconfig slice for eth1 on the head node:
eth1      Link encap:Ethernet  HWaddr 00:40:F4:2A:6E:26
inet addr:192.168.5.200 Bcast:192.168.5.255Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:176581970 errors:0 dropped:0 overruns:0 frame:0
          TX packets:160905314 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0
RX bytes:1187468116 (1132.4 Mb) TX bytes:2350492219(2241.6 Mb)Can I trust the output of /sbin/ifconfig (meaning, if /sbin/ifconfigsays MULTICAST is running, is that the REAL truth, or can the kernelstill suppress multicast transmissions??)
The kernel's firewalling configuration can still filter out multicasttraffic. Check your firewall config (man iptables :) ). If yourconfig is very restrictive, at least poke a li'l hole for themulticast IP/port combo.
IIRC, the default iptables behavior changed a few point releases backin Redhat - it's now on. This is apparently to help everyone who'sinstalling it on their desktop connected to the net via cable modemfrom getting owned...
Also, gmetad cares not one whit about /etc/gmond.conf. I just did aonce-over on the code to make absolutely sure, there's no mention ofit. It's /etc/gmetad.conf that you should concern yourself with on thehead units if you're having display problems. Unless they're alsosupposed to be part of the cluster, in which case you would configurethe gmonds separately.
Remember to open firewall ports for TCP port 8649 on hosts running themonitoring core and TCP port 8651 for the hosts running gmetad.
The metadaemon should be determining the path to establish itsconnections via the good ol' fashioned kernel routing table, just likeanything else.
As a test, I've been running gmond on one node in deaf debug mode,and on another node in mute debug mode. The deaf one is pumping outdata successfully and the mute one is not seeing anything. Since thisis compute node to compute node, there can only be one interface(eth0). There has to be something in the kernel config that isscrewing this up.
That sounds like it's a firewall config issue or a router/switchconfig issue to me...
I'm wondering with all the kernel upgrades going on out there, maybesomeone has had similar issues? Thanks in advance for any info!
7.2 / 2.4.19smp on most of our nodes here, no reported problems withthe monitoring core on any of them.
Happy Holidays To All,
-Phil Forrest
Yeah, happy Life Day, kids. ;)

Hope this info proves useful...
Phil Forrest
334-844-6910
Auburn University Dept. of Physics
Network & Scientific Computing
207 Leach Science Center



-------------------------------------------------------
This SF.NET email is sponsored by: Geek Gift Procrastinating?
Get the perfect geek gift now!  Before the Holidays pass you by.
T H I N K G E E K . C O M      http://www.thinkgeek.com/sf/
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] Kernel Upgrade killed my happy Ganglia

Reply via email to