I'd check the network equipment if I were you. The specifics of that are
of course vendor-dependent (HP makes Gig-E switches? What's next, gaming
consoles?). Make sure it hasn't been configured to drop multicast traffic
or something (it could happen!).
Oh yeah, and try increasing mcast_ttl on one of the systems by at least 1.
Try 3, then restart the monitoring core, see if the other nodes pick up
its multicast data. Might wanna do that first, since it's easy and doesn't
require a console cable. :)
Anyway, that's about it for me today. Back to debugging X...
Phil Forrest wrote:
Steve & Lester,
Thanks for your help. I think I'm getting somewhere, but I'm not sure
where ;)
I DO get data when I telnet to a particular node, but I get data for
ONLY that node, and no other node. I don't know squat about multicast,
and that may be my undoing here. But the ping results are interesting
from a compute node and one head node:
[EMAIL PROTECTED] etc]# ping 239.2.11.71
PING 239.2.11.71 (239.2.11.71) from 192.168.5.2 : 56(84) bytes of data.
64 bytes from 192.168.5.2: icmp_seq=0 ttl=255 time=577 usec
64 bytes from 192.168.5.2: icmp_seq=1 ttl=255 time=87 usec
64 bytes from 192.168.5.2: icmp_seq=2 ttl=255 time=60 usec
[EMAIL PROTECTED] root]# ping 239.2.11.71
PING 239.2.11.71 (239.2.11.71) from 192.168.5.38 : 56(84) bytes of data.
64 bytes from 192.168.5.38: icmp_seq=0 ttl=255 time=320 usec
64 bytes from 192.168.5.38: icmp_seq=1 ttl=255 time=12 usec
64 bytes from 192.168.5.38: icmp_seq=2 ttl=255 time=10 usec
64 bytes from 192.168.5.38: icmp_seq=3 ttl=255 time=11 usec
[EMAIL PROTECTED] etc]# ping -I 192.168.5.200 239.2.11.71
PING 239.2.11.71 (239.2.11.71) from 192.168.5.200 : 56(84) bytes of data.
64 bytes from 192.168.5.200: icmp_seq=0 ttl=255 time=81 usec
64 bytes from 192.168.5.200: icmp_seq=1 ttl=255 time=41 usec
64 bytes from 192.168.5.200: icmp_seq=2 ttl=255 time=31 usec
64 bytes from 192.168.5.200: icmp_seq=3 ttl=255 time=29 usec
64 bytes from 192.168.5.200: icmp_seq=4 ttl=255 time=25 usec
So, this may be it - I only get a reply from the host that I am on when
I ping the multicast address.
This may be revealing for the head node....apparently ping is defaulting
to the public IP on the eth0 interface for prism. I had to tell it to
look at the internal LAN. If I indeed used the ping command correctly,
this suggest something wrong with multicast. We have an HP Procurve
Gigabit switch as the only switch between the compute nodes and the two
head nodes. I haven't messed with any settings on that device since the
vendor installed it.
Just for fun, I told gmetad to point to another source (a compute node)
for data. The same thing happens - i.e., only the data for that
particular node is shown. What other configuration problems could lead
to this?
If I pull any more hair out, I'm gonna look like Kojak!
Thanks again for the help,
-Phil
At 02:31 PM 12/19/2002, you wrote:
When in doubt, use telnet.
See what telnetting to node02 from prism (which I assume is one of the
head systems) on port 8649 gets you. You should at the very least get
a Ganglia DTD and XML for one node. If you don't, something is really
wrong (congratulations, you found a weird bug). If you see data from
only one node, try the same thing on the other nodes - poll the others
on that port and see if you see *their* node information.
If, for example, node 1 only has data about itself, whereas nodes 2-4
have information on all nodes except for node 1 ... well, you can
probably figure out where the disconnect is there. ;)
If all the nodes only have data about themselves, that suggests either
a misconfiguration (unlikely, since it was all working with the same
config before you upgraded ... *cough* right? :) ), a kernel issue
(although I can't think of anything off the top of my head in the
kernel config or any kflags) or a network issue. Try pinging the
multicast IP (you should get a response from every node that's
listening on that IP) and your old friends, tcpdump and a really long
cat5 cable. As you've said, the XDR packets are being *sent* by at
least one of your nodes - you're watching them go out. So, you run
tcpdump on that host and make sure they actually make it out at the
kernel level. Then you go one hop down on the network and run tcpdump
again. You should still see them. Keep going one spigot at a time
until you've traversed the path from one node to another (this may be
a short trip if they're on the same switch of course :) ).
Also, remember that gmonds have a "trusted host" config directive and
that they will accept and immediately close connections from hosts
*not* in that list. To gmetad this will look like a successful
connection with no data.
Phil Forrest
334-844-6910
Auburn University Dept. of Physics
Network & Scientific Computing
207 Leach Science Center
-------------------------------------------------------
This SF.NET email is sponsored by: Geek Gift Procrastinating?
Get the perfect geek gift now! Before the Holidays pass you by.
T H I N K G E E K . C O M http://www.thinkgeek.com/sf/
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general