RE: [Ganglia-general] Kernel Upgrade killed my happy Ganglia

Kent IV, William (WW) Thu, 19 Dec 2002 14:30:45 -0800

I've got an almost identical problem, except I'm using a 3Com 3924 GigE switch 
(and Dlink DGE-500T adapters).  Also, the motherboards have on-board 10/100 
connections that aren't being used.


The switches appear to work fine, because if I go back to the 10/100 adapters I 
see all the nodes.  No recompile or reconfig of Ganglia needed, just the 
reconfig of the network to use the other adapter.  

I have configured gmond.conf to use the appropriate interface for multicast, 
but haven't messed with the mcast_ttl.  If it were the switch, I'd expect it to 
fail under 10/100 as well as 1000 (and it doesn't).

I've always thought it was a problem with the Dlink adapters and/or their 
drivers.  Maybe I need to pay for some support from RedHat?  I'll try Steven's 
suggestions as well, but probably not until after the holidays.

Bill Kent
Dow Chemical
[EMAIL PROTECTED]

-----Original Message-----
From: Steven Wagner [mailto:[EMAIL PROTECTED] 
Sent: Thursday, December 19, 2002 4:49 PM
To: Phil Forrest
Cc: [email protected]
Subject: Re: [Ganglia-general] Kernel Upgrade killed my happy Ganglia


I'd check the network equipment if I were you.  The specifics of that are 
of course vendor-dependent (HP makes Gig-E switches?  What's next, gaming 
consoles?).  Make sure it hasn't been configured to drop multicast traffic 
or something (it could happen!).

Oh yeah, and try increasing mcast_ttl on one of the systems by at least 1. 
  Try 3, then restart the monitoring core, see if the other nodes pick up 
its multicast data.  Might wanna do that first, since it's easy and doesn't 
require a console cable. :)

Anyway, that's about it for me today.  Back to debugging X...

Phil Forrest wrote:
> Steve & Lester,
> 
> Thanks for your help. I think I'm getting somewhere, but I'm not sure 
> where ;)
> I DO get data when I telnet to a particular node, but I get data for 
> ONLY that node, and no other node. I don't know squat about multicast, 
> and that may be my undoing here. But the ping results are interesting 
> from a compute node and one head node:
> 
> 
> [EMAIL PROTECTED] etc]# ping 239.2.11.71
> PING 239.2.11.71 (239.2.11.71) from 192.168.5.2 : 56(84) bytes of data.
> 64 bytes from 192.168.5.2: icmp_seq=0 ttl=255 time=577 usec
> 64 bytes from 192.168.5.2: icmp_seq=1 ttl=255 time=87 usec
> 64 bytes from 192.168.5.2: icmp_seq=2 ttl=255 time=60 usec
> 
> [EMAIL PROTECTED] root]# ping 239.2.11.71
> PING 239.2.11.71 (239.2.11.71) from 192.168.5.38 : 56(84) bytes of data.
> 64 bytes from 192.168.5.38: icmp_seq=0 ttl=255 time=320 usec
> 64 bytes from 192.168.5.38: icmp_seq=1 ttl=255 time=12 usec
> 64 bytes from 192.168.5.38: icmp_seq=2 ttl=255 time=10 usec
> 64 bytes from 192.168.5.38: icmp_seq=3 ttl=255 time=11 usec
> 
> [EMAIL PROTECTED] etc]# ping -I 192.168.5.200 239.2.11.71
> PING 239.2.11.71 (239.2.11.71) from 192.168.5.200 : 56(84) bytes of data.
> 64 bytes from 192.168.5.200: icmp_seq=0 ttl=255 time=81 usec
> 64 bytes from 192.168.5.200: icmp_seq=1 ttl=255 time=41 usec
> 64 bytes from 192.168.5.200: icmp_seq=2 ttl=255 time=31 usec
> 64 bytes from 192.168.5.200: icmp_seq=3 ttl=255 time=29 usec
> 64 bytes from 192.168.5.200: icmp_seq=4 ttl=255 time=25 usec
> 
> So, this may be it - I only get a reply from the host that I am on when 
> I ping the multicast address.
> This may be revealing for the head node....apparently ping is defaulting 
> to the public IP on the eth0 interface for prism. I had to tell it to 
> look at the internal LAN. If I indeed used the ping command correctly, 
> this suggest something wrong with multicast. We have an HP Procurve 
> Gigabit switch as the only switch between the compute nodes and the two 
> head nodes. I haven't messed with any settings on that device since the 
> vendor installed it.
> 
> Just for fun, I told gmetad to point to another source (a compute node) 
> for data. The same thing happens - i.e., only the data for that 
> particular node is shown. What other configuration problems could lead 
> to this?
> 
> If I pull any more hair out, I'm gonna look like Kojak!
> 
> Thanks again for the help,
> -Phil
> 
> At 02:31 PM 12/19/2002, you wrote:
> 
>> When in doubt, use telnet.
>>
>> See what telnetting to node02 from prism (which I assume is one of the 
>> head systems) on port 8649 gets you.  You should at the very least get 
>> a Ganglia DTD and XML for one node.  If you don't, something is really 
>> wrong (congratulations, you found a weird bug).  If you see data from 
>> only one node, try the same thing on the other nodes - poll the others 
>> on that port and see if you see *their* node information.
>>
>> If, for example, node 1 only has data about itself, whereas nodes 2-4 
>> have information on all nodes except for node 1 ... well, you can 
>> probably figure out where the disconnect is there. ;)
>>
>> If all the nodes only have data about themselves, that suggests either 
>> a misconfiguration (unlikely, since it was all working with the same 
>> config before you upgraded ... *cough*  right?  :) ), a kernel issue 
>> (although I can't think of anything off the top of my head in the 
>> kernel config or any kflags) or a network issue.  Try pinging the 
>> multicast IP (you should get a response from every node that's 
>> listening on that IP) and your old friends, tcpdump and a really long 
>> cat5 cable.  As you've said, the XDR packets are being *sent* by at 
>> least one of your nodes - you're watching them go out. So, you run 
>> tcpdump on that host and make sure they actually make it out at the 
>> kernel level.  Then you go one hop down on the network and run tcpdump 
>> again.  You should still see them.  Keep going one spigot at a time 
>> until you've traversed the path from one node to another (this may be 
>> a short trip if they're on the same switch of course :) ).
>>
>> Also, remember that gmonds have a "trusted host" config directive and 
>> that they will accept and immediately close connections from hosts 
>> *not* in that  list.  To gmetad this will look like a successful 
>> connection with no data.
> 
> 
> Phil Forrest
> 334-844-6910
> Auburn University Dept. of Physics
> Network & Scientific Computing
> 207 Leach Science Center
> 
> 
> 
> -------------------------------------------------------
> This SF.NET email is sponsored by: Geek Gift Procrastinating?
> Get the perfect geek gift now!  Before the Holidays pass you by.
> T H I N K G E E K . C O M      http://www.thinkgeek.com/sf/
> _______________________________________________
> Ganglia-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/ganglia-general




-------------------------------------------------------
This SF.NET email is sponsored by: Geek Gift Procrastinating?
Get the perfect geek gift now!  Before the Holidays pass you by.
T H I N K G E E K . C O M      http://www.thinkgeek.com/sf/
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

RE: [Ganglia-general] Kernel Upgrade killed my happy Ganglia

Reply via email to