hi folks, I'm no ganglia power user--I've always just set it up for the handful of machines sharing a switch and it Just Worked. My latest cluster is on someone else's network--that is to say, the group that owns the cluster doesn't own/manage/have any access to the switches stuff is plugged into. All they have is a bunch of cables running out of the wall, so I can't even see what machines are plugged into what networking equipment. There are "a bunch of cisco 6500's" in the closet, says He Who Once Saw Inside. Must be a pretty big closet. Anyhow, point being that I don't know how the machines are interconnected in that networking closet. They're all on bonded ethernet links, which work fine to talk to one another via unicast. They all run ganglia 3.1 on debian linux, and all are configured identically on the OS side, to point to the node running the webserver as their only "trusted_host" using interface bond0 as "mcast_if". The only other setting in the gmond.conf beyond name/owner/latlong/num_nodes is "all_trusted on".
So, all that lead-in aside, I have gmond running on a group of 30 machines, and only 20 of them see eachother. The remaining 10 are in a separate rack, and of a new stock--they probably showed up a year or more after the first 20. So hey, who knows, they are maybe on a different switch. And the switches maybe don't do multicast routing. All are on the same broadcast unicast subnet, none run firewalls. So far I've tried a few things, none of which seemed to make any difference: gstat -a shows the 20 hosts gstat -d showed the 10 down hosts, with varying "last report" times, including some within the past 15 minutes, some only reporting a day or two before, corresponding to a time varying numbers of minutes after their last boot. restarted all gmetad/gmond on the cluster. the 20 hosts popped right back on, the 10 dead hosts have not reappeared as dead, yet. gmond on the "trusted_host" run with debuglevel of 10 showed 0 mentions of these hosts by IP or name during 5 minutes of watching. telnet missing_host 8649 dumps a valid looking ganglia xml file. sometimes it has a mention of other non-missing hosts, but that mention is always short (a line or two, versus nearly 40 for a full host status), and is never of all hosts, just one or two. it does not mention any of the *other* missing machines. gstat -a on a missing host shows only itself gstat -d on a missing host shows nothing. telnet present_host 8649 does not report any of the missing hosts. So I'm constrained in reconfiguring the network--if it doesn't work, it's not likely to start. I can ask, but I have no faith that anything will happen. Does my guess that it might be a multicast routing thing make sense? If not, what further steps would you recommend I take? If so, how can I set ganglia up to monitor hosts by unicast, or report to the webserver-node by unicast? I added the missing hosts to the data_source in gmetad.conf on the head node to no avail... Thanks in advance for any advice! ------------------------------------------------------------------------------ Apps built with the Adobe(R) Flex(R) framework and Flex Builder(TM) are powering Web 2.0 with engaging, cross-platform capabilities. Quickly and easily build your RIAs with Flex Builder, the Eclipse(TM)based development software that enables intelligent coding and step-through debugging. Download the free 60 day trial. http://p.sf.net/sfu/www-adobe-com _______________________________________________ Ganglia-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/ganglia-general

