Re: [Ganglia-general] Kernel Upgrade killed my happy Ganglia

Steven Wagner Thu, 19 Dec 2002 13:49:24 -0800

I'd check the network equipment if I were you. The specifics of that areof course vendor-dependent (HP makes Gig-E switches? What's next, gamingconsoles?). Make sure it hasn't been configured to drop multicast trafficor something (it could happen!).

Oh yeah, and try increasing mcast_ttl on one of the systems by at least 1.Try 3, then restart the monitoring core, see if the other nodes pick upits multicast data. Might wanna do that first, since it's easy and doesn'trequire a console cable. :)


Anyway, that's about it for me today.  Back to debugging X...

Phil Forrest wrote:

Steve & Lester,
Thanks for your help. I think I'm getting somewhere, but I'm not surewhere ;)I DO get data when I telnet to a particular node, but I get data forONLY that node, and no other node. I don't know squat about multicast,and that may be my undoing here. But the ping results are interestingfrom a compute node and one head node:
[EMAIL PROTECTED] etc]# ping 239.2.11.71
PING 239.2.11.71 (239.2.11.71) from 192.168.5.2 : 56(84) bytes of data.
64 bytes from 192.168.5.2: icmp_seq=0 ttl=255 time=577 usec
64 bytes from 192.168.5.2: icmp_seq=1 ttl=255 time=87 usec
64 bytes from 192.168.5.2: icmp_seq=2 ttl=255 time=60 usec

[EMAIL PROTECTED] root]# ping 239.2.11.71
PING 239.2.11.71 (239.2.11.71) from 192.168.5.38 : 56(84) bytes of data.
64 bytes from 192.168.5.38: icmp_seq=0 ttl=255 time=320 usec
64 bytes from 192.168.5.38: icmp_seq=1 ttl=255 time=12 usec
64 bytes from 192.168.5.38: icmp_seq=2 ttl=255 time=10 usec
64 bytes from 192.168.5.38: icmp_seq=3 ttl=255 time=11 usec

[EMAIL PROTECTED] etc]# ping -I 192.168.5.200 239.2.11.71
PING 239.2.11.71 (239.2.11.71) from 192.168.5.200 : 56(84) bytes of data.
64 bytes from 192.168.5.200: icmp_seq=0 ttl=255 time=81 usec
64 bytes from 192.168.5.200: icmp_seq=1 ttl=255 time=41 usec
64 bytes from 192.168.5.200: icmp_seq=2 ttl=255 time=31 usec
64 bytes from 192.168.5.200: icmp_seq=3 ttl=255 time=29 usec
64 bytes from 192.168.5.200: icmp_seq=4 ttl=255 time=25 usec
So, this may be it - I only get a reply from the host that I am on whenI ping the multicast address.This may be revealing for the head node....apparently ping is defaultingto the public IP on the eth0 interface for prism. I had to tell it tolook at the internal LAN. If I indeed used the ping command correctly,this suggest something wrong with multicast. We have an HP ProcurveGigabit switch as the only switch between the compute nodes and the twohead nodes. I haven't messed with any settings on that device since thevendor installed it.
Just for fun, I told gmetad to point to another source (a compute node)for data. The same thing happens - i.e., only the data for thatparticular node is shown. What other configuration problems could leadto this?
If I pull any more hair out, I'm gonna look like Kojak!

Thanks again for the help,
-Phil

At 02:31 PM 12/19/2002, you wrote:
When in doubt, use telnet.
See what telnetting to node02 from prism (which I assume is one of thehead systems) on port 8649 gets you. You should at the very least geta Ganglia DTD and XML for one node. If you don't, something is reallywrong (congratulations, you found a weird bug). If you see data fromonly one node, try the same thing on the other nodes - poll the otherson that port and see if you see *their* node information.
If, for example, node 1 only has data about itself, whereas nodes 2-4have information on all nodes except for node 1 ... well, you canprobably figure out where the disconnect is there. ;)
If all the nodes only have data about themselves, that suggests eithera misconfiguration (unlikely, since it was all working with the sameconfig before you upgraded ... *cough* right? :) ), a kernel issue(although I can't think of anything off the top of my head in thekernel config or any kflags) or a network issue. Try pinging themulticast IP (you should get a response from every node that'slistening on that IP) and your old friends, tcpdump and a really longcat5 cable. As you've said, the XDR packets are being *sent* by atleast one of your nodes - you're watching them go out. So, you runtcpdump on that host and make sure they actually make it out at thekernel level. Then you go one hop down on the network and run tcpdumpagain. You should still see them. Keep going one spigot at a timeuntil you've traversed the path from one node to another (this may bea short trip if they're on the same switch of course :) ).
Also, remember that gmonds have a "trusted host" config directive andthat they will accept and immediately close connections from hosts*not* in that list. To gmetad this will look like a successfulconnection with no data.
Phil Forrest
334-844-6910
Auburn University Dept. of Physics
Network & Scientific Computing
207 Leach Science Center



-------------------------------------------------------
This SF.NET email is sponsored by: Geek Gift Procrastinating?
Get the perfect geek gift now!  Before the Holidays pass you by.
T H I N K G E E K . C O M      http://www.thinkgeek.com/sf/
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] Kernel Upgrade killed my happy Ganglia

Reply via email to