hi folks,

I'm no ganglia power user--I've always just set it up for the handful
of machines sharing a switch and it Just Worked.  My latest cluster is
on someone else's network--that is to say, the group that owns the
cluster doesn't own/manage/have any access to the switches stuff is
plugged into.  All they have is a bunch of cables running out of the
wall, so I can't even see what machines are plugged into what
networking equipment.  There are "a bunch of cisco 6500's" in the
closet, says He Who Once Saw Inside.  Must be a pretty big closet.
Anyhow, point being that I don't know how the machines are
interconnected in that networking closet.
They're all on bonded ethernet links, which work fine to talk to one
another via unicast.  They all run ganglia 3.1 on debian linux, and
all are configured identically on the OS side, to point to the node
running the webserver as their only "trusted_host" using interface
bond0 as "mcast_if".  The only other setting in the gmond.conf beyond
name/owner/latlong/num_nodes is "all_trusted on".

So, all that lead-in aside, I have gmond running on a group of 30
machines, and only 20 of them see eachother.  The remaining 10 are in
a separate rack, and of a new stock--they probably showed up a year or
more after the first 20.  So hey, who knows, they are maybe on a
different switch.  And the switches maybe don't do multicast routing.
All are on the same broadcast unicast subnet, none run firewalls.

So far I've tried a few things, none of which seemed to make any difference:

gstat -a shows the 20 hosts
gstat -d showed the 10 down hosts, with varying "last report" times,
including some within the past 15 minutes, some only reporting a day
or two before, corresponding to a time varying numbers of minutes
after their last boot.
restarted all gmetad/gmond on the cluster.  the 20 hosts popped right
back on, the 10 dead hosts have not reappeared as dead, yet.
gmond on the "trusted_host" run with debuglevel of 10 showed 0
mentions of these hosts by IP or name during 5 minutes of watching.
telnet missing_host 8649 dumps a valid looking ganglia xml file.
sometimes it has a mention of other non-missing hosts, but that
mention is always short (a line or two, versus nearly 40 for a full
host status), and is never of all hosts, just one or two.  it does not
mention any of the *other* missing machines.
gstat -a on a missing host shows only itself
gstat -d on a missing host shows nothing.
telnet present_host 8649 does not report any of the missing hosts.

So I'm constrained in reconfiguring the network--if it doesn't work,
it's not likely to start.  I can ask, but I have no faith that
anything will happen.  Does my guess that it might be a multicast
routing thing make sense?  If not, what further steps would you
recommend I take?  If so, how can I set ganglia up to monitor hosts by
unicast, or report to the webserver-node by unicast?  I added the
missing hosts to the data_source in gmetad.conf on the head node to no
avail...

Thanks in advance for any advice!

------------------------------------------------------------------------------
Apps built with the Adobe(R) Flex(R) framework and Flex Builder(TM) are
powering Web 2.0 with engaging, cross-platform capabilities. Quickly and
easily build your RIAs with Flex Builder, the Eclipse(TM)based development
software that enables intelligent coding and step-through debugging.
Download the free 60 day trial. http://p.sf.net/sfu/www-adobe-com
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to