Steve & Lester,

Thanks for your help. I think I'm getting somewhere, but I'm not sure where ;)
I DO get data when I telnet to a particular node, but I get data for ONLY that node, and no other node. I don't know squat about multicast, and that may be my undoing here. But the ping results are interesting from a compute node and one head node:


[EMAIL PROTECTED] etc]# ping 239.2.11.71
PING 239.2.11.71 (239.2.11.71) from 192.168.5.2 : 56(84) bytes of data.
64 bytes from 192.168.5.2: icmp_seq=0 ttl=255 time=577 usec
64 bytes from 192.168.5.2: icmp_seq=1 ttl=255 time=87 usec
64 bytes from 192.168.5.2: icmp_seq=2 ttl=255 time=60 usec

[EMAIL PROTECTED] root]# ping 239.2.11.71
PING 239.2.11.71 (239.2.11.71) from 192.168.5.38 : 56(84) bytes of data.
64 bytes from 192.168.5.38: icmp_seq=0 ttl=255 time=320 usec
64 bytes from 192.168.5.38: icmp_seq=1 ttl=255 time=12 usec
64 bytes from 192.168.5.38: icmp_seq=2 ttl=255 time=10 usec
64 bytes from 192.168.5.38: icmp_seq=3 ttl=255 time=11 usec

[EMAIL PROTECTED] etc]# ping -I 192.168.5.200 239.2.11.71
PING 239.2.11.71 (239.2.11.71) from 192.168.5.200 : 56(84) bytes of data.
64 bytes from 192.168.5.200: icmp_seq=0 ttl=255 time=81 usec
64 bytes from 192.168.5.200: icmp_seq=1 ttl=255 time=41 usec
64 bytes from 192.168.5.200: icmp_seq=2 ttl=255 time=31 usec
64 bytes from 192.168.5.200: icmp_seq=3 ttl=255 time=29 usec
64 bytes from 192.168.5.200: icmp_seq=4 ttl=255 time=25 usec

So, this may be it - I only get a reply from the host that I am on when I ping the multicast address. This may be revealing for the head node....apparently ping is defaulting to the public IP on the eth0 interface for prism. I had to tell it to look at the internal LAN. If I indeed used the ping command correctly, this suggest something wrong with multicast. We have an HP Procurve Gigabit switch as the only switch between the compute nodes and the two head nodes. I haven't messed with any settings on that device since the vendor installed it.

Just for fun, I told gmetad to point to another source (a compute node) for data. The same thing happens - i.e., only the data for that particular node is shown. What other configuration problems could lead to this?

If I pull any more hair out, I'm gonna look like Kojak!

Thanks again for the help,
-Phil

At 02:31 PM 12/19/2002, you wrote:
When in doubt, use telnet.

See what telnetting to node02 from prism (which I assume is one of the head systems) on port 8649 gets you. You should at the very least get a Ganglia DTD and XML for one node. If you don't, something is really wrong (congratulations, you found a weird bug). If you see data from only one node, try the same thing on the other nodes - poll the others on that port and see if you see *their* node information.

If, for example, node 1 only has data about itself, whereas nodes 2-4 have information on all nodes except for node 1 ... well, you can probably figure out where the disconnect is there. ;)

If all the nodes only have data about themselves, that suggests either a misconfiguration (unlikely, since it was all working with the same config before you upgraded ... *cough* right? :) ), a kernel issue (although I can't think of anything off the top of my head in the kernel config or any kflags) or a network issue. Try pinging the multicast IP (you should get a response from every node that's listening on that IP) and your old friends, tcpdump and a really long cat5 cable. As you've said, the XDR packets are being *sent* by at least one of your nodes - you're watching them go out. So, you run tcpdump on that host and make sure they actually make it out at the kernel level. Then you go one hop down on the network and run tcpdump again. You should still see them. Keep going one spigot at a time until you've traversed the path from one node to another (this may be a short trip if they're on the same switch of course :) ).

Also, remember that gmonds have a "trusted host" config directive and that they will accept and immediately close connections from hosts *not* in that list. To gmetad this will look like a successful connection with no data.

Phil Forrest
334-844-6910
Auburn University Dept. of Physics
Network & Scientific Computing
207 Leach Science Center


Reply via email to