Re: [Ganglia-general] gmetad mixing up nodes from different clusters

Dan Bretherton Fri, 02 Nov 2007 04:57:24 -0800

Dear Matthias,

Thanks for taking the time to look at this problem.


> What happens if you go with the standard tcp_accept_channel section ?

That's how it was set up before POL opened port 8649 and changed 
tcp_accept_channel for us.  The POL cluster report page looked exactly the 
same as it does now.  Before the POL gmond was directly accessible I used a 
SSH tunnel to deliver to Ganglia data to my Web frontend server.  The data 
source lines in my gmetad.conf looked like this.

data_source "POL's gmond through SSH tunnel" 65 localhost:8648
data_source "ESSC's gmond" 60 essc.host.name:8649
data_source "BAS's gmond through SSH tunnel" 70 localhost:8647

I asked POL to make the changes because I thought the problem was related to 
having more than one data source coming from localhost (the end points of the 
tunnels).

> Some :-)  The XML ouput from POL clearly shows the notes from the other
> clusters being part of POL.

Actually I don't think it does.  I realise I should have explained the 
situation more clearly at the beginning.  Let me describe the three clusters 
in a bit more detail so it is easier to see what should and shouldn't be 
attributed to POL.

POL cluster nodes:
=================
node001.beowulf.cluster
node002.beowulf.cluster
               :
node090.beowulf.cluster
file01.beowulf.cluster
file02.beowulf.cluster
nemo.beowulf.cluster
nemo2.beowulf.cluster

BAS cluster nodes:
=================
node001.beowulf.cluster
node002.beowulf.cluster
             :
node032.beowulf.cluster
bslhadesws1.beowulf.cluster
bslhadesws2.beowulf.cluster
bslhadesws3.beowulf.cluster
bslhadesws4.beowulf.cluster
bslhadesws5.beowulf.cluster
quad001.beowulf.cluster
quad002.beowulf.cluster
quad003.beowulf.cluster
quad004.beowulf.cluster
master.beowulf.cluster
db01.beowulf.cluster

ESSC cluster nodes:
=================
node001.beowulf.cluster
node002.beowulf.cluster
               :
node016.beowulf.cluster
node101.beowulf.cluster
node102.beowulf.cluster
node103.beowulf.cluster
node104.beowulf.cluster
master.beowulf.cluster
storage.beowulf.cluster

I checked the POL gmond XML data again today to verify that none of the other 
clusters' nodes were listed.  I also checked the load_one measurements for 
every POL node against the correct values from POL's internal Ganglia 
Webfrontend.  I found no evidence of incorrect nodes or load values in POL's 
gmond XML data.

> the question arises how do you separate the cluster into cluster-local
> domains ?  In other words, you somehow need to ensure that only the nodes
> from POL talk to the gmonds running on POL.

I'm pretty sure that is the situation we have now.

> Can all nodes talk to each other directly ?

No. I can't think of a way that could possibly happen and I haven't found any 
evidence for it in the gmond XML data.  I have checked the gmond data from 
all three clusters for evidence of nodes being mixed up.  All three clusters 
are behind their institutional firewalls and the BAS cluster data comes here 
via a SSH tunnel (as POL's was too until recently).  I should add that the 
POL cluster report page is correct if I remove the other data sources from 
gmetad.conf, which suggests to me that the problem is with my Web frontend 
rather than POL's gmond.

> Maybe you want to consider 
> using different mcast IPs for the different cluster ?
>
> In prinipal, I would first simplify the gmond.conf files and then play
> with the mcast addresses.  If that starts to work, then I would add the
> access control.

Thanks for the suggestions.  Do you have any other ideas in the light of the 
above?

Regards,
-Dan.

>
>
> Matthias
>
> > -Dan.
> >
> > On Wednesday 31 Oct 2007 19:09, Matthias Blankenhaus wrote:
> > > Dan,
> > >
> > > could you post the relevant snippets from gmond.conf from your cluster
> > > nodes ?
> > >
> > > What is the XML output from gmond on the POL cluster ?
> > >
> > > Thanx,
> > > Matthias
> > >
> > > On Wed, 31 Oct 2007, Dan Bretherton wrote:
> > > > Dear All,
> > > >
> > > > Here are some updates to the message I posted to the list yesterday:
> > > >
> > > > 1) The XML data from gmetad seems to be correct.  I got this data
> > > > from "telnet localhost 8651".  I can't see any incorrect nodes listed
> > > > under POL, so I now suspect the Web frontend rather than gmetad.
> > > >
> > > > 2) the data in /var/lib/ganglia/rrds seems to be correct.  There are
> > > > no incorrect nodes listed in the directory for the POL cluster.  This
> > > > also points the finger at the Web frontend.
> > > >
> > > > 3) I have tried out the latest versions of gmetad and the Web
> > > > frontend (3.0.5) with the latest version of rrdtool (1.2.23) on
> > > > another computer to make sure the problem is not being caused by a
> > > > bug that has been fixed.  I found that the same problem occurs with
> > > > the latest versions so I have left the public server
> > > > (http://www.resc.reading.ac.uk/ganglia/) on ganglia version 3.0.3 and
> > > > rrdtool version 1.2.15
> > > >
> > > > 4) The BAS cluster nodes actually have different IP addresses to the
> > > > POL nodes of the same name, so the IP addresses are not the cause of
> > > > the BAS nodes being listed in the POL cluster report.
> > > >
> > > > Regards,
> > > > -Dan.
> > > >
> > > > On Tuesday 30 Oct 2007, you wrote:
> > > > > Dear All,
> > > > >
> > > > > This is the first time I have posted to the list, but I have made
> > > > > good use of the archives on many occasions.  Unfortunately I can't
> > > > > find anything in the archives to help with my current problem.
> > > > >
> > > > > I am monitoring a grid consisting of clusters at three institutions
> > > > > called POL, BAS and ESSC.  The clusters are all from the same
> > > > > supplier and use the same convention for slave node IP addresses
> > > > > and host names. All the clusters are behind their own institutional
> > > > > firewalls.  My Ganglia Web frontend is at the following address:
> > > > > http://www.resc.reading.ac.uk/ganglia/
> > > > >
> > > > > My problem is that the POL cluster report mixes up nodes from all
> > > > > three clusters.  The POL cluster is listed as "NEMO cluster @ POL"
> > > > > on the grid report page of my Web frontend. There are three main
> > > > > problems with the POL cluster report:
> > > > > 1)  Nodes at ESSC and BAS with names not found at POL usually show
> > > > > up as blank spaces on the POL cluster page unless they are down, in
> > > > > which case they are represented by the usual pink box
> > > > > 2) The load level colouring (and hence the positioning on the page)
> > > > > of nodes that have the same name as nodes in other clusters is
> > > > > often governed by the other clusters
> > > > > 3) The overview section of the POL cluster report has incorrect
> > > > > values for load percentages and number of CPUs etc.
> > > > >
> > > > > Here is an excerpt from my gmetad.conf file showing the three data
> > > > > sources. The host names have been changed for security reasons.
> > > > >
> > > > > data_source "POL's gmond" 65 pol.host.name:8649
> > > > > data_source "ESSC's gmond" 60 essc.host.name:8649
> > > > > data_source "BAS's gmond through SSH tunnel" 70 localhost:8647
> > > > >
> > > > > Here is some more information I think may be relevant.
> > > > > -- The ESSC cluster is on the same subnet as my Web frontend server
> > > > > -- There are no problems with the ESSC and BAS cluster reports
> > > > > -- The XML data received from POL's gmond is correct
> > > > > -- My gmetad version is 3.0.3, but I get the same problem on my
> > > > > backup gmetad machine which still has version 2.5.7
> > > > > -- POL's gmond is version 3.0.3, but ESSC and BAS have gmond
> > > > > version 2.5.7 -- Accessing POL's gmond through a different port via
> > > > > an SSH tunnel (i.e. localhost:8648 instead of pol.host.name:8649)
> > > > > makes no difference -- Changing the order of the data sources in
> > > > > gmetad.conf makes no difference -- Removing either the ESSC or the
> > > > > BAS data source makes no difference; the POL cluster report still
> > > > > gets mixed up with the other cluster, which ever one it is
> > > > > -- Deleting all the RRD files in /var/lib/ganglia/rrds/ and
> > > > > starting again makes no difference
> > > > > -- The grid report page has correct values for the POL cluster
> > > > >
> > > > > I could change the host names and IP addresses of the ESSC cluster
> > > > > nodes, but that wouldn't stop the POL cluster report getting
> > > > > confused with BAS nodes and changing those clusters is not an
> > > > > option.  Is there any way to solve this problem without making the
> > > > > node names of all the clusters different?  All suggestions would be
> > > > > gratefully received.  I hope I haven't missed something obvious.
> > > > >
> > > > > -Dan Bretherton.
> > > >
> > > > --
> > > > Mr. D.A. Bretherton
> > > > Environmental Systems Science Centre
> > > > Harry Pitt Building
> > > > 3 Earley Gate
> > > > Reading University
> > > > Reading, RG6 6AL
> > > > UK
> > > >
> > > > Tel. +44 118 378 7722
> > > >
> > > > ---------------------------------------------------------------------
> > > >---- This SF.net email is sponsored by: Splunk Inc.
> > > > Still grepping through log files to find problems?  Stop.
> > > > Now Search log events and configuration files using AJAX and a
> > > > browser. Download your FREE copy of Splunk now >>
> > > > http://get.splunk.com/
> > > > _______________________________________________
> > > > Ganglia-general mailing list
> > > > [email protected]
> > > > https://lists.sourceforge.net/lists/listinfo/ganglia-general
> >
> > --
> > Mr. D.A. Bretherton
> > Environmental Systems Science Centre
> > Harry Pitt Building
> > 3 Earley Gate
> > Reading University
> > Reading, RG6 6AL
> > UK
> >
> > Tel. +44 118 378 7722

-- 
Mr. D.A. Bretherton
Environmental Systems Science Centre
Harry Pitt Building
3 Earley Gate
Reading University
Reading, RG6 6AL
UK

Tel. +44 118 378 7722

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] gmetad mixing up nodes from different clusters

Reply via email to