Thanks Matt... I don't really understand the issue myself, being a UNIX person and not a network person. They spoke of routers and turning on multicast routing and new tables and the inability of certain classes of addresses receiving multicast traffic.

But basically they are now not unhappy with me. What you say makes sense to me too, so I don't really understand their issue.

The real benefit to me, however, is that I only need to restart the gmond daemon on the two listener machines when a node is removed from the cluster (for repair or retirement) rather than the gmond daemons on all the nodes as I had to do when they all listened.

Thanks,
Paul


Matt Massie wrote:

paul-

i'm a little confused here.  if you run all your cluster hosts in "deaf"
mode except for two hosts, then the amount of multicast traffic would
not change.

with your current configuration every host is still multicasting on the
channel (since they are not "mute") but only two hosts are listening and
saving the data on the multicast channel.

the only thing you save with your configuration is some memory on each
of your cluster nodes (since they are not storing the data).

a 160 node cluster should only use about 65 kbs of network bandwidth. what they might have been seeing is a spike that can occur when lots of
hosts crash and then rejoin the group or your reboot the entire
cluster.  traffic increases for able five minutes or so as each node
syncs with the rest of the cluster.

if you are really having problems with multicast, you might try the
2.6.0 beta which supports unicast UDP.  in that case, all nodes in a
cluster direct their messages directly to one (or a few) hosts... which
is what i think you are trying to accomplish.

-matt



On Fri, 2004-06-04 at 10:51, Paul Henderson wrote:
We don't really consider the ganglia monitoring critical, so we can sustain what would be an exceedingly rare failure like both nodes going down.

The real reason I did this was to reduce multicast traffic... the network guys were getting blue in the face talking about how 160 nodes were all broadcasting and listening at the same time. I don't understand the actual mechanics, but they are now happy (or should I say "marginally happier"... they never seem to really be happy ;-)

Paul
Princeton Plasma Physics Lab

Bernard Li wrote:

Hey Paul:

But I guess if the odd chance of both the two nodes going down, then
your history will be lost...

Of course if you are using Ganglia on a large cluster, you probably
don't want every node to be sending packets to each other ;-)

Cheers,

Bernard


-----Original Message-----
From: Paul Henderson [mailto:[EMAIL PROTECTED] Sent: Friday, June 04, 2004 10:38
To: Johnston Michael J Contr AFRL/DES
Cc: Bernard Li; [email protected]
Subject: Re: [Ganglia-general] All my nodes listed as clusters

What I've been doing is running the gmond on all my cluster nodes, but making all but 2 of my 160 nodes "deaf" (see gmond.conf). All the nodes then multicast their information, but only two hold the data, the other nodes just broadcast but don't hold any data.

This is *really* useful, because if one node dies or is moved, then you don't have to restart gmond on every single node to get it to 'forget' the node... you just need to do it on the two listening nodes. Also, network traffic is significantly reduced.

Paul
Princeton Plasma Physics Lab

Johnston Michael J Contr AFRL/DES wrote:

Thanks for the response Bernard!

I guess I didn't think that I could only put 1 node in the
data_source
line because how does it know to go and collect the
information from
the other nodes? Does it just scan the subnet looking for
any machine
running gmond? Every one of my nodes has the exact same gmond.conf file on it with the name of my cluster in it. Is that how it knows?

Thanks for asking about the graphs... Thanks to everyone's
pointers, I
learned that I had listed the path to the RRDtool directory, but hadn't put the executable name into the path. After I
changed that it
all started working... ;) Ganglia is really awesome!

Mike


----------------------------------------------------------------------
--

*From:* Bernard Li [mailto:[EMAIL PROTECTED]
*Sent:* Friday, June 04, 2004 11:18 AM
*To:* Johnston Michael J Contr AFRL/DES; [email protected]
*Subject:* RE: [Ganglia-general] All my nodes listed as clusters

If you only have one cluster, you only need one data_source
(think of
the data_source as the headnode of your cluster, if you will).

So you just need one entry for data_source - you can put
more than one
node in the data_source entry for redundancy purposes.

So I take it you can see your graph now and the previous thread you posted is dead?

Cheers,

Bernard

----------------------------------------------------------------------
--

  *From:* [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]
*On Behalf Of
  *Johnston Michael J Contr AFRL/DES
  *Sent:* Friday, June 04, 2004 8:37
  *To:* [email protected]
  *Subject:* [Ganglia-general] All my nodes listed as clusters

  I have a silly question, as usual...

When I bring up the view of my cluster, it comes up as
a Grid... so
  it .looks like this:

  Grid > MyCluster > Choose a Node

  I'm guessing that's because in my gmetad.conf file I have every
  node in my cluster listed as:

  data_source "N1" 60 192.168.3.2:8649

  data_source "N2" 60 192.168.3.3:8649

  I'm sure that I'm listing them wrong because Ganglia thinks that
  each node is its own cluster. My question is how do I make them
appear like one unit as I see in the demo pages? Do I
add them all
  to one data_source line?

On a side question, is it normal for my head node to
always be in
  the red? It looks like it's only using about 8% CPU, but it's
  always red or orange.





-------------------------------------------------------
This SF.Net email is sponsored by the new InstallShield X.
From Windows to Linux, servers to mobile, InstallShield X is the one
installation-authoring solution that does it all. Learn more and
evaluate today! http://www.installshield.com/Dev2Dev/0504
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

--
Paul Henderson
UNIX Systems Engineering Group
Princeton Plasma Physics Laboratory
Princeton, NJ 08543
(609) 243-2412




Reply via email to