On Fri, 2 Nov 2007, Dan Bretherton wrote:
> Dear Matthias,
>
> Thanks for taking the time to look at this problem.
>
> > What happens if you go with the standard tcp_accept_channel section ?
>
> That's how it was set up before POL opened port 8649 and changed
> tcp_accept_channel for us. The POL cluster report page looked exactly the
> same as it does now. Before the POL gmond was directly accessible I used a
> SSH tunnel to deliver to Ganglia data to my Web frontend server. The data
> source lines in my gmetad.conf looked like this.
>
> data_source "POL's gmond through SSH tunnel" 65 localhost:8648
> data_source "ESSC's gmond" 60 essc.host.name:8649
> data_source "BAS's gmond through SSH tunnel" 70 localhost:8647
>
> I asked POL to make the changes because I thought the problem was related to
> having more than one data source coming from localhost (the end points of the
> tunnels).
>
Ok.
> > Some :-) The XML ouput from POL clearly shows the notes from the other
> > clusters being part of POL.
>
> Actually I don't think it does. I realise I should have explained the
> situation more clearly at the beginning. Let me describe the three clusters
> in a bit more detail so it is easier to see what should and shouldn't be
> attributed to POL.
>
Ok, so then the title of this posting is kinda misleading :-) In fact,
gmetad is then not mixing up the nodes.
Mh, what's your problem again ? :-)
Matthias
> POL cluster nodes:
> =================
> node001.beowulf.cluster
> node002.beowulf.cluster
> :
> node090.beowulf.cluster
> file01.beowulf.cluster
> file02.beowulf.cluster
> nemo.beowulf.cluster
> nemo2.beowulf.cluster
>
> BAS cluster nodes:
> =================
> node001.beowulf.cluster
> node002.beowulf.cluster
> :
> node032.beowulf.cluster
> bslhadesws1.beowulf.cluster
> bslhadesws2.beowulf.cluster
> bslhadesws3.beowulf.cluster
> bslhadesws4.beowulf.cluster
> bslhadesws5.beowulf.cluster
> quad001.beowulf.cluster
> quad002.beowulf.cluster
> quad003.beowulf.cluster
> quad004.beowulf.cluster
> master.beowulf.cluster
> db01.beowulf.cluster
>
> ESSC cluster nodes:
> =================
> node001.beowulf.cluster
> node002.beowulf.cluster
> :
> node016.beowulf.cluster
> node101.beowulf.cluster
> node102.beowulf.cluster
> node103.beowulf.cluster
> node104.beowulf.cluster
> master.beowulf.cluster
> storage.beowulf.cluster
>
> I checked the POL gmond XML data again today to verify that none of the other
> clusters' nodes were listed. I also checked the load_one measurements for
> every POL node against the correct values from POL's internal Ganglia
> Webfrontend. I found no evidence of incorrect nodes or load values in POL's
> gmond XML data.
>
> > the question arises how do you separate the cluster into cluster-local
> > domains ? In other words, you somehow need to ensure that only the nodes
> > from POL talk to the gmonds running on POL.
>
> I'm pretty sure that is the situation we have now.
>
> > Can all nodes talk to each other directly ?
>
> No. I can't think of a way that could possibly happen and I haven't found any
> evidence for it in the gmond XML data. I have checked the gmond data from
> all three clusters for evidence of nodes being mixed up. All three clusters
> are behind their institutional firewalls and the BAS cluster data comes here
> via a SSH tunnel (as POL's was too until recently). I should add that the
> POL cluster report page is correct if I remove the other data sources from
> gmetad.conf, which suggests to me that the problem is with my Web frontend
> rather than POL's gmond.
>
> > Maybe you want to consider
> > using different mcast IPs for the different cluster ?
> >
> > In prinipal, I would first simplify the gmond.conf files and then play
> > with the mcast addresses. If that starts to work, then I would add the
> > access control.
>
> Thanks for the suggestions. Do you have any other ideas in the light of the
> above?
>
> Regards,
> -Dan.
>
> >
> >
> > Matthias
> >
> > > -Dan.
> > >
> > > On Wednesday 31 Oct 2007 19:09, Matthias Blankenhaus wrote:
> > > > Dan,
> > > >
> > > > could you post the relevant snippets from gmond.conf from your cluster
> > > > nodes ?
> > > >
> > > > What is the XML output from gmond on the POL cluster ?
> > > >
> > > > Thanx,
> > > > Matthias
> > > >
> > > > On Wed, 31 Oct 2007, Dan Bretherton wrote:
> > > > > Dear All,
> > > > >
> > > > > Here are some updates to the message I posted to the list yesterday:
> > > > >
> > > > > 1) The XML data from gmetad seems to be correct. I got this data
> > > > > from "telnet localhost 8651". I can't see any incorrect nodes listed
> > > > > under POL, so I now suspect the Web frontend rather than gmetad.
> > > > >
> > > > > 2) the data in /var/lib/ganglia/rrds seems to be correct. There are
> > > > > no incorrect nodes listed in the directory for the POL cluster. This
> > > > > also points the finger at the Web frontend.
> > > > >
> > > > > 3) I have tried out the latest versions of gmetad and the Web
> > > > > frontend (3.0.5) with the latest version of rrdtool (1.2.23) on
> > > > > another computer to make sure the problem is not being caused by a
> > > > > bug that has been fixed. I found that the same problem occurs with
> > > > > the latest versions so I have left the public server
> > > > > (http://www.resc.reading.ac.uk/ganglia/) on ganglia version 3.0.3 and
> > > > > rrdtool version 1.2.15
> > > > >
> > > > > 4) The BAS cluster nodes actually have different IP addresses to the
> > > > > POL nodes of the same name, so the IP addresses are not the cause of
> > > > > the BAS nodes being listed in the POL cluster report.
> > > > >
> > > > > Regards,
> > > > > -Dan.
> > > > >
> > > > > On Tuesday 30 Oct 2007, you wrote:
> > > > > > Dear All,
> > > > > >
> > > > > > This is the first time I have posted to the list, but I have made
> > > > > > good use of the archives on many occasions. Unfortunately I can't
> > > > > > find anything in the archives to help with my current problem.
> > > > > >
> > > > > > I am monitoring a grid consisting of clusters at three institutions
> > > > > > called POL, BAS and ESSC. The clusters are all from the same
> > > > > > supplier and use the same convention for slave node IP addresses
> > > > > > and host names. All the clusters are behind their own institutional
> > > > > > firewalls. My Ganglia Web frontend is at the following address:
> > > > > > http://www.resc.reading.ac.uk/ganglia/
> > > > > >
> > > > > > My problem is that the POL cluster report mixes up nodes from all
> > > > > > three clusters. The POL cluster is listed as "NEMO cluster @ POL"
> > > > > > on the grid report page of my Web frontend. There are three main
> > > > > > problems with the POL cluster report:
> > > > > > 1) Nodes at ESSC and BAS with names not found at POL usually show
> > > > > > up as blank spaces on the POL cluster page unless they are down, in
> > > > > > which case they are represented by the usual pink box
> > > > > > 2) The load level colouring (and hence the positioning on the page)
> > > > > > of nodes that have the same name as nodes in other clusters is
> > > > > > often governed by the other clusters
> > > > > > 3) The overview section of the POL cluster report has incorrect
> > > > > > values for load percentages and number of CPUs etc.
> > > > > >
> > > > > > Here is an excerpt from my gmetad.conf file showing the three data
> > > > > > sources. The host names have been changed for security reasons.
> > > > > >
> > > > > > data_source "POL's gmond" 65 pol.host.name:8649
> > > > > > data_source "ESSC's gmond" 60 essc.host.name:8649
> > > > > > data_source "BAS's gmond through SSH tunnel" 70 localhost:8647
> > > > > >
> > > > > > Here is some more information I think may be relevant.
> > > > > > -- The ESSC cluster is on the same subnet as my Web frontend server
> > > > > > -- There are no problems with the ESSC and BAS cluster reports
> > > > > > -- The XML data received from POL's gmond is correct
> > > > > > -- My gmetad version is 3.0.3, but I get the same problem on my
> > > > > > backup gmetad machine which still has version 2.5.7
> > > > > > -- POL's gmond is version 3.0.3, but ESSC and BAS have gmond
> > > > > > version 2.5.7 -- Accessing POL's gmond through a different port via
> > > > > > an SSH tunnel (i.e. localhost:8648 instead of pol.host.name:8649)
> > > > > > makes no difference -- Changing the order of the data sources in
> > > > > > gmetad.conf makes no difference -- Removing either the ESSC or the
> > > > > > BAS data source makes no difference; the POL cluster report still
> > > > > > gets mixed up with the other cluster, which ever one it is
> > > > > > -- Deleting all the RRD files in /var/lib/ganglia/rrds/ and
> > > > > > starting again makes no difference
> > > > > > -- The grid report page has correct values for the POL cluster
> > > > > >
> > > > > > I could change the host names and IP addresses of the ESSC cluster
> > > > > > nodes, but that wouldn't stop the POL cluster report getting
> > > > > > confused with BAS nodes and changing those clusters is not an
> > > > > > option. Is there any way to solve this problem without making the
> > > > > > node names of all the clusters different? All suggestions would be
> > > > > > gratefully received. I hope I haven't missed something obvious.
> > > > > >
> > > > > > -Dan Bretherton.
> > > > >
> > > > > --
> > > > > Mr. D.A. Bretherton
> > > > > Environmental Systems Science Centre
> > > > > Harry Pitt Building
> > > > > 3 Earley Gate
> > > > > Reading University
> > > > > Reading, RG6 6AL
> > > > > UK
> > > > >
> > > > > Tel. +44 118 378 7722
> > > > >
> > > > > ---------------------------------------------------------------------
> > > > >---- This SF.net email is sponsored by: Splunk Inc.
> > > > > Still grepping through log files to find problems? Stop.
> > > > > Now Search log events and configuration files using AJAX and a
> > > > > browser. Download your FREE copy of Splunk now >>
> > > > > http://get.splunk.com/
> > > > > _______________________________________________
> > > > > Ganglia-general mailing list
> > > > > [email protected]
> > > > > https://lists.sourceforge.net/lists/listinfo/ganglia-general
> > >
> > > --
> > > Mr. D.A. Bretherton
> > > Environmental Systems Science Centre
> > > Harry Pitt Building
> > > 3 Earley Gate
> > > Reading University
> > > Reading, RG6 6AL
> > > UK
> > >
> > > Tel. +44 118 378 7722
>
> --
> Mr. D.A. Bretherton
> Environmental Systems Science Centre
> Harry Pitt Building
> 3 Earley Gate
> Reading University
> Reading, RG6 6AL
> UK
>
> Tel. +44 118 378 7722
>
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general