Re: [Ganglia-general] gmetad mixing up nodes from different clusters

Matthias Blankenhaus Fri, 02 Nov 2007 13:48:52 -0800


On Fri, 2 Nov 2007, Dan Bretherton wrote:


> Dear Matthias,
> 
> Thanks for taking the time to look at this problem.
> 
> > What happens if you go with the standard tcp_accept_channel section ?
> 
> That's how it was set up before POL opened port 8649 and changed 
> tcp_accept_channel for us.  The POL cluster report page looked exactly the 
> same as it does now.  Before the POL gmond was directly accessible I used a 
> SSH tunnel to deliver to Ganglia data to my Web frontend server.  The data 
> source lines in my gmetad.conf looked like this.
> 
> data_source "POL's gmond through SSH tunnel" 65 localhost:8648
> data_source "ESSC's gmond" 60 essc.host.name:8649
> data_source "BAS's gmond through SSH tunnel" 70 localhost:8647
> 
> I asked POL to make the changes because I thought the problem was related to 
> having more than one data source coming from localhost (the end points of the 
> tunnels).
> 

Ok.

> > Some :-)  The XML ouput from POL clearly shows the notes from the other
> > clusters being part of POL.
> 
> Actually I don't think it does.  I realise I should have explained the 
> situation more clearly at the beginning.  Let me describe the three clusters 
> in a bit more detail so it is easier to see what should and shouldn't be 
> attributed to POL.
> 

Ok, so then the title of this posting is kinda misleading :-)  In fact, 
gmetad is then not mixing up the nodes.

Mh, what's your problem again ? :-)



Matthias


> POL cluster nodes:
> =================
> node001.beowulf.cluster
> node002.beowulf.cluster
>                :
> node090.beowulf.cluster
> file01.beowulf.cluster
> file02.beowulf.cluster
> nemo.beowulf.cluster
> nemo2.beowulf.cluster
> 
> BAS cluster nodes:
> =================
> node001.beowulf.cluster
> node002.beowulf.cluster
>              :
> node032.beowulf.cluster
> bslhadesws1.beowulf.cluster
> bslhadesws2.beowulf.cluster
> bslhadesws3.beowulf.cluster
> bslhadesws4.beowulf.cluster
> bslhadesws5.beowulf.cluster
> quad001.beowulf.cluster
> quad002.beowulf.cluster
> quad003.beowulf.cluster
> quad004.beowulf.cluster
> master.beowulf.cluster
> db01.beowulf.cluster
> 
> ESSC cluster nodes:
> =================
> node001.beowulf.cluster
> node002.beowulf.cluster
>                :
> node016.beowulf.cluster
> node101.beowulf.cluster
> node102.beowulf.cluster
> node103.beowulf.cluster
> node104.beowulf.cluster
> master.beowulf.cluster
> storage.beowulf.cluster
> 
> I checked the POL gmond XML data again today to verify that none of the other 
> clusters' nodes were listed.  I also checked the load_one measurements for 
> every POL node against the correct values from POL's internal Ganglia 
> Webfrontend.  I found no evidence of incorrect nodes or load values in POL's 
> gmond XML data.
> 
> > the question arises how do you separate the cluster into cluster-local
> > domains ?  In other words, you somehow need to ensure that only the nodes
> > from POL talk to the gmonds running on POL.
> 
> I'm pretty sure that is the situation we have now.
> 
> > Can all nodes talk to each other directly ?
> 
> No. I can't think of a way that could possibly happen and I haven't found any 
> evidence for it in the gmond XML data.  I have checked the gmond data from 
> all three clusters for evidence of nodes being mixed up.  All three clusters 
> are behind their institutional firewalls and the BAS cluster data comes here 
> via a SSH tunnel (as POL's was too until recently).  I should add that the 
> POL cluster report page is correct if I remove the other data sources from 
> gmetad.conf, which suggests to me that the problem is with my Web frontend 
> rather than POL's gmond.
> 
> > Maybe you want to consider 
> > using different mcast IPs for the different cluster ?
> >
> > In prinipal, I would first simplify the gmond.conf files and then play
> > with the mcast addresses.  If that starts to work, then I would add the
> > access control.
> 
> Thanks for the suggestions.  Do you have any other ideas in the light of the 
> above?
> 
> Regards,
> -Dan.
> 
> >
> >
> > Matthias
> >
> > > -Dan.
> > >
> > > On Wednesday 31 Oct 2007 19:09, Matthias Blankenhaus wrote:
> > > > Dan,
> > > >
> > > > could you post the relevant snippets from gmond.conf from your cluster
> > > > nodes ?
> > > >
> > > > What is the XML output from gmond on the POL cluster ?
> > > >
> > > > Thanx,
> > > > Matthias
> > > >
> > > > On Wed, 31 Oct 2007, Dan Bretherton wrote:
> > > > > Dear All,
> > > > >
> > > > > Here are some updates to the message I posted to the list yesterday:
> > > > >
> > > > > 1) The XML data from gmetad seems to be correct.  I got this data
> > > > > from "telnet localhost 8651".  I can't see any incorrect nodes listed
> > > > > under POL, so I now suspect the Web frontend rather than gmetad.
> > > > >
> > > > > 2) the data in /var/lib/ganglia/rrds seems to be correct.  There are
> > > > > no incorrect nodes listed in the directory for the POL cluster.  This
> > > > > also points the finger at the Web frontend.
> > > > >
> > > > > 3) I have tried out the latest versions of gmetad and the Web
> > > > > frontend (3.0.5) with the latest version of rrdtool (1.2.23) on
> > > > > another computer to make sure the problem is not being caused by a
> > > > > bug that has been fixed.  I found that the same problem occurs with
> > > > > the latest versions so I have left the public server
> > > > > (http://www.resc.reading.ac.uk/ganglia/) on ganglia version 3.0.3 and
> > > > > rrdtool version 1.2.15
> > > > >
> > > > > 4) The BAS cluster nodes actually have different IP addresses to the
> > > > > POL nodes of the same name, so the IP addresses are not the cause of
> > > > > the BAS nodes being listed in the POL cluster report.
> > > > >
> > > > > Regards,
> > > > > -Dan.
> > > > >
> > > > > On Tuesday 30 Oct 2007, you wrote:
> > > > > > Dear All,
> > > > > >
> > > > > > This is the first time I have posted to the list, but I have made
> > > > > > good use of the archives on many occasions.  Unfortunately I can't
> > > > > > find anything in the archives to help with my current problem.
> > > > > >
> > > > > > I am monitoring a grid consisting of clusters at three institutions
> > > > > > called POL, BAS and ESSC.  The clusters are all from the same
> > > > > > supplier and use the same convention for slave node IP addresses
> > > > > > and host names. All the clusters are behind their own institutional
> > > > > > firewalls.  My Ganglia Web frontend is at the following address:
> > > > > > http://www.resc.reading.ac.uk/ganglia/
> > > > > >
> > > > > > My problem is that the POL cluster report mixes up nodes from all
> > > > > > three clusters.  The POL cluster is listed as "NEMO cluster @ POL"
> > > > > > on the grid report page of my Web frontend. There are three main
> > > > > > problems with the POL cluster report:
> > > > > > 1)  Nodes at ESSC and BAS with names not found at POL usually show
> > > > > > up as blank spaces on the POL cluster page unless they are down, in
> > > > > > which case they are represented by the usual pink box
> > > > > > 2) The load level colouring (and hence the positioning on the page)
> > > > > > of nodes that have the same name as nodes in other clusters is
> > > > > > often governed by the other clusters
> > > > > > 3) The overview section of the POL cluster report has incorrect
> > > > > > values for load percentages and number of CPUs etc.
> > > > > >
> > > > > > Here is an excerpt from my gmetad.conf file showing the three data
> > > > > > sources. The host names have been changed for security reasons.
> > > > > >
> > > > > > data_source "POL's gmond" 65 pol.host.name:8649
> > > > > > data_source "ESSC's gmond" 60 essc.host.name:8649
> > > > > > data_source "BAS's gmond through SSH tunnel" 70 localhost:8647
> > > > > >
> > > > > > Here is some more information I think may be relevant.
> > > > > > -- The ESSC cluster is on the same subnet as my Web frontend server
> > > > > > -- There are no problems with the ESSC and BAS cluster reports
> > > > > > -- The XML data received from POL's gmond is correct
> > > > > > -- My gmetad version is 3.0.3, but I get the same problem on my
> > > > > > backup gmetad machine which still has version 2.5.7
> > > > > > -- POL's gmond is version 3.0.3, but ESSC and BAS have gmond
> > > > > > version 2.5.7 -- Accessing POL's gmond through a different port via
> > > > > > an SSH tunnel (i.e. localhost:8648 instead of pol.host.name:8649)
> > > > > > makes no difference -- Changing the order of the data sources in
> > > > > > gmetad.conf makes no difference -- Removing either the ESSC or the
> > > > > > BAS data source makes no difference; the POL cluster report still
> > > > > > gets mixed up with the other cluster, which ever one it is
> > > > > > -- Deleting all the RRD files in /var/lib/ganglia/rrds/ and
> > > > > > starting again makes no difference
> > > > > > -- The grid report page has correct values for the POL cluster
> > > > > >
> > > > > > I could change the host names and IP addresses of the ESSC cluster
> > > > > > nodes, but that wouldn't stop the POL cluster report getting
> > > > > > confused with BAS nodes and changing those clusters is not an
> > > > > > option.  Is there any way to solve this problem without making the
> > > > > > node names of all the clusters different?  All suggestions would be
> > > > > > gratefully received.  I hope I haven't missed something obvious.
> > > > > >
> > > > > > -Dan Bretherton.
> > > > >
> > > > > --
> > > > > Mr. D.A. Bretherton
> > > > > Environmental Systems Science Centre
> > > > > Harry Pitt Building
> > > > > 3 Earley Gate
> > > > > Reading University
> > > > > Reading, RG6 6AL
> > > > > UK
> > > > >
> > > > > Tel. +44 118 378 7722
> > > > >
> > > > > ---------------------------------------------------------------------
> > > > >---- This SF.net email is sponsored by: Splunk Inc.
> > > > > Still grepping through log files to find problems?  Stop.
> > > > > Now Search log events and configuration files using AJAX and a
> > > > > browser. Download your FREE copy of Splunk now >>
> > > > > http://get.splunk.com/
> > > > > _______________________________________________
> > > > > Ganglia-general mailing list
> > > > > [email protected]
> > > > > https://lists.sourceforge.net/lists/listinfo/ganglia-general
> > >
> > > --
> > > Mr. D.A. Bretherton
> > > Environmental Systems Science Centre
> > > Harry Pitt Building
> > > 3 Earley Gate
> > > Reading University
> > > Reading, RG6 6AL
> > > UK
> > >
> > > Tel. +44 118 378 7722
> 
> -- 
> Mr. D.A. Bretherton
> Environmental Systems Science Centre
> Harry Pitt Building
> 3 Earley Gate
> Reading University
> Reading, RG6 6AL
> UK
> 
> Tel. +44 118 378 7722
>

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/

_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] gmetad mixing up nodes from different clusters

Reply via email to