Re: [Ganglia-general] gmetad mixing up nodes from different clusters

Dan Bretherton Mon, 12 Nov 2007 10:44:02 -0800

On Monday 05 Nov 2007 19:52, Matthias Blankenhaus wrote:
> > > Ok, so then the title of this posting is kinda misleading :-)  In fact,
> > > gmetad is then not mixing up the nodes.
>
> I actually have the same setup :)  And I am well aware of the difference
> of these two daemons.


Sorry, I thought you were confusing gmond with gmetad.  I was just trying to 
be helpful.

> Ok.  I looked at your web link, but again still did not see nodes being
> mixed up.  I went to your three cluster and saw the nodes listed according
> to your gmond.conf files.

I have attached a screen shot of the POL cluster report taken just now.  The 
nodes that are coloured red, yellow and orange do not have anything running 
on them.  The nodes that have jobs running on them are coloured blue, and are 
in various positions below where the screen shot ends.  The two nodes shown 
as down are not down; the nodes that are down are in a different cluster.  
The blank spaces are nodes in the other clusters whose names do not match up 
to any of the POL nodes.  You can tell this by looking at the URL at the 
bottom of the browser (Firefox in my case) when the mouse pointer is moved 
over the blank spaces. 

> If indeed, the XML output from gmetad looks 
> fine, e.g. the cluster nodes are listed according to their cluster, and
> you experience a mix-up in the GUI, again I was not able to see that, then
> yes, it can only be something with the presentation layer or your RRD
> records are incorrect.

Is this a bug, and if so should I report it to the developers?

-Dan.

>
>
> Cheers,
> Matthias
>
> > Regards,
> > -Dan Bretherton.
> >
> > > Matthias
> > >
> > > > POL cluster nodes:
> > > > =================
> > > > node001.beowulf.cluster
> > > > node002.beowulf.cluster
> > > >
> > > > node090.beowulf.cluster
> > > > file01.beowulf.cluster
> > > > file02.beowulf.cluster
> > > > nemo.beowulf.cluster
> > > > nemo2.beowulf.cluster
> > > >
> > > > BAS cluster nodes:
> > > > =================
> > > > node001.beowulf.cluster
> > > > node002.beowulf.cluster
> > > >
> > > > node032.beowulf.cluster
> > > > bslhadesws1.beowulf.cluster
> > > > bslhadesws2.beowulf.cluster
> > > > bslhadesws3.beowulf.cluster
> > > > bslhadesws4.beowulf.cluster
> > > > bslhadesws5.beowulf.cluster
> > > > quad001.beowulf.cluster
> > > > quad002.beowulf.cluster
> > > > quad003.beowulf.cluster
> > > > quad004.beowulf.cluster
> > > > master.beowulf.cluster
> > > > db01.beowulf.cluster
> > > >
> > > > ESSC cluster nodes:
> > > > =================
> > > > node001.beowulf.cluster
> > > > node002.beowulf.cluster
> > > >
> > > > node016.beowulf.cluster
> > > > node101.beowulf.cluster
> > > > node102.beowulf.cluster
> > > > node103.beowulf.cluster
> > > > node104.beowulf.cluster
> > > > master.beowulf.cluster
> > > > storage.beowulf.cluster
> > > >
> > > > I checked the POL gmond XML data again today to verify that none of
> > > > the other clusters' nodes were listed.  I also checked the load_one
> > > > measurements for every POL node against the correct values from POL's
> > > > internal Ganglia Webfrontend.  I found no evidence of incorrect nodes
> > > > or load values in POL's gmond XML data.
> > > >
> > > > > the question arises how do you separate the cluster into
> > > > > cluster-local domains ?  In other words, you somehow need to ensure
> > > > > that only the nodes from POL talk to the gmonds running on POL.
> > > >
> > > > I'm pretty sure that is the situation we have now.
> > > >
> > > > > Can all nodes talk to each other directly ?
> > > >
> > > > No. I can't think of a way that could possibly happen and I haven't
> > > > found any evidence for it in the gmond XML data.  I have checked the
> > > > gmond data from all three clusters for evidence of nodes being mixed
> > > > up.  All three clusters are behind their institutional firewalls and
> > > > the BAS cluster data comes here via a SSH tunnel (as POL's was too
> > > > until recently).  I should add that the POL cluster report page is
> > > > correct if I remove the other data sources from gmetad.conf, which
> > > > suggests to me that the problem is with my Web frontend rather than
> > > > POL's gmond.
> > > >
> > > > > Maybe you want to consider
> > > > > using different mcast IPs for the different cluster ?
> > > > >
> > > > > In prinipal, I would first simplify the gmond.conf files and then
> > > > > play with the mcast addresses.  If that starts to work, then I
> > > > > would add the access control.
> > > >
> > > > Thanks for the suggestions.  Do you have any other ideas in the light
> > > > of the above?
> > > >
> > > > Regards,
> > > > -Dan.
> > > >
> > > > > Matthias
> > > > >
> > > > > > -Dan.
> > > > > >
> > > > > > On Wednesday 31 Oct 2007 19:09, Matthias Blankenhaus wrote:
> > > > > > > Dan,
> > > > > > >
> > > > > > > could you post the relevant snippets from gmond.conf from your
> > > > > > > cluster nodes ?
> > > > > > >
> > > > > > > What is the XML output from gmond on the POL cluster ?
> > > > > > >
> > > > > > > Thanx,
> > > > > > > Matthias
> > > > > > >
> > > > > > > On Wed, 31 Oct 2007, Dan Bretherton wrote:
> > > > > > > > Dear All,
> > > > > > > >
> > > > > > > > Here are some updates to the message I posted to the list
> > > > > > > > yesterday:
> > > > > > > >
> > > > > > > > 1) The XML data from gmetad seems to be correct.  I got this
> > > > > > > > data from "telnet localhost 8651".  I can't see any incorrect
> > > > > > > > nodes listed under POL, so I now suspect the Web frontend
> > > > > > > > rather than gmetad.
> > > > > > > >
> > > > > > > > 2) the data in /var/lib/ganglia/rrds seems to be correct. 
> > > > > > > > There are no incorrect nodes listed in the directory for the
> > > > > > > > POL cluster.  This also points the finger at the Web
> > > > > > > > frontend.
> > > > > > > >
> > > > > > > > 3) I have tried out the latest versions of gmetad and the Web
> > > > > > > > frontend (3.0.5) with the latest version of rrdtool (1.2.23)
> > > > > > > > on another computer to make sure the problem is not being
> > > > > > > > caused by a bug that has been fixed.  I found that the same
> > > > > > > > problem occurs with the latest versions so I have left the
> > > > > > > > public server (http://www.resc.reading.ac.uk/ganglia/) on
> > > > > > > > ganglia version 3.0.3 and rrdtool version 1.2.15
> > > > > > > >
> > > > > > > > 4) The BAS cluster nodes actually have different IP addresses
> > > > > > > > to the POL nodes of the same name, so the IP addresses are
> > > > > > > > not the cause of the BAS nodes being listed in the POL
> > > > > > > > cluster report.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > -Dan.
> > > > > > > >
> > > > > > > > On Tuesday 30 Oct 2007, you wrote:
> > > > > > > > > Dear All,
> > > > > > > > >
> > > > > > > > > This is the first time I have posted to the list, but I
> > > > > > > > > have made good use of the archives on many occasions.
> > > > > > > > >  Unfortunately I can't find anything in the archives to
> > > > > > > > > help with my current problem.
> > > > > > > > >
> > > > > > > > > I am monitoring a grid consisting of clusters at three
> > > > > > > > > institutions called POL, BAS and ESSC.  The clusters are
> > > > > > > > > all from the same supplier and use the same convention for
> > > > > > > > > slave node IP addresses and host names. All the clusters
> > > > > > > > > are behind their own institutional firewalls.  My Ganglia
> > > > > > > > > Web frontend is at the following address:
> > > > > > > > > http://www.resc.reading.ac.uk/ganglia/
> > > > > > > > >
> > > > > > > > > My problem is that the POL cluster report mixes up nodes
> > > > > > > > > from all three clusters.  The POL cluster is listed as
> > > > > > > > > "NEMO cluster @ POL" on the grid report page of my Web
> > > > > > > > > frontend. There are three main problems with the POL
> > > > > > > > > cluster report:
> > > > > > > > > 1)  Nodes at ESSC and BAS with names not found at POL
> > > > > > > > > usually show up as blank spaces on the POL cluster page
> > > > > > > > > unless they are down, in which case they are represented by
> > > > > > > > > the usual pink box 2) The load level colouring (and hence
> > > > > > > > > the positioning on the page) of nodes that have the same
> > > > > > > > > name as nodes in other clusters is often governed by the
> > > > > > > > > other clusters
> > > > > > > > > 3) The overview section of the POL cluster report has
> > > > > > > > > incorrect values for load percentages and number of CPUs
> > > > > > > > > etc.
> > > > > > > > >
> > > > > > > > > Here is an excerpt from my gmetad.conf file showing the
> > > > > > > > > three data sources. The host names have been changed for
> > > > > > > > > security reasons.
> > > > > > > > >
> > > > > > > > > data_source "POL's gmond" 65 pol.host.name:8649
> > > > > > > > > data_source "ESSC's gmond" 60 essc.host.name:8649
> > > > > > > > > data_source "BAS's gmond through SSH tunnel" 70
> > > > > > > > > localhost:8647
> > > > > > > > >
> > > > > > > > > Here is some more information I think may be relevant.
> > > > > > > > > -- The ESSC cluster is on the same subnet as my Web
> > > > > > > > > frontend server -- There are no problems with the ESSC and
> > > > > > > > > BAS cluster reports -- The XML data received from POL's
> > > > > > > > > gmond is correct -- My gmetad version is 3.0.3, but I get
> > > > > > > > > the same problem on my backup gmetad machine which still
> > > > > > > > > has version 2.5.7 -- POL's gmond is version 3.0.3, but ESSC
> > > > > > > > > and BAS have gmond version 2.5.7 -- Accessing POL's gmond
> > > > > > > > > through a different port via an SSH tunnel (i.e.
> > > > > > > > > localhost:8648 instead of
> > > > > > > > > pol.host.name:8649) makes no difference -- Changing the
> > > > > > > > > order of the data sources in gmetad.conf makes no
> > > > > > > > > difference -- Removing either the ESSC or the BAS data
> > > > > > > > > source makes no difference; the POL cluster report still
> > > > > > > > > gets mixed up with the other cluster, which ever one it is
> > > > > > > > > -- Deleting all the RRD files in /var/lib/ganglia/rrds/ and
> > > > > > > > > starting again makes no difference
> > > > > > > > > -- The grid report page has correct values for the POL
> > > > > > > > > cluster
> > > > > > > > >
> > > > > > > > > I could change the host names and IP addresses of the ESSC
> > > > > > > > > cluster nodes, but that wouldn't stop the POL cluster
> > > > > > > > > report getting confused with BAS nodes and changing those
> > > > > > > > > clusters is not an option.  Is there any way to solve this
> > > > > > > > > problem without making the node names of all the clusters
> > > > > > > > > different?  All suggestions would be gratefully received.
> > > > > > > > >  I hope I haven't missed something obvious.
> > > > > > > > >
> > > > > > > > > -Dan Bretherton.
> > > > > > > >
> > > > > > > > --
> > > > > > > > Mr. D.A. Bretherton
> > > > > > > > Environmental Systems Science Centre
> > > > > > > > Harry Pitt Building
> > > > > > > > 3 Earley Gate
> > > > > > > > Reading University
> > > > > > > > Reading, RG6 6AL
> > > > > > > > UK
> > > > > > > >
> > > > > > > > Tel. +44 118 378 7722
> > > > > > > >
> > > > > > > > -------------------------------------------------------------
> > > > > > > >---- ---- ---- This SF.net email is sponsored by: Splunk Inc.
> > > > > > > > Still grepping through log files to find problems?  Stop. Now
> > > > > > > > Search log events and configuration files using AJAX and a
> > > > > > > > browser. Download your FREE copy of Splunk now >>
> > > > > > > > http://get.splunk.com/
> > > > > > > > _______________________________________________
> > > > > > > > Ganglia-general mailing list
> > > > > > > > [email protected]
> > > > > > > > https://lists.sourceforge.net/lists/listinfo/ganglia-general
> > > > > >
> > > > > > --
> > > > > > Mr. D.A. Bretherton
> > > > > > Environmental Systems Science Centre
> > > > > > Harry Pitt Building
> > > > > > 3 Earley Gate
> > > > > > Reading University
> > > > > > Reading, RG6 6AL
> > > > > > UK
> > > > > >
> > > > > > Tel. +44 118 378 7722
> > > >
> > > > --
> > > > Mr. D.A. Bretherton
> > > > Environmental Systems Science Centre
> > > > Harry Pitt Building
> > > > 3 Earley Gate
> > > > Reading University
> > > > Reading, RG6 6AL
> > > > UK
> > > >
> > > > Tel. +44 118 378 7722

-- 
Mr. D.A. Bretherton
Environmental Systems Science Centre
Harry Pitt Building
3 Earley Gate
Reading University
Reading, RG6 6AL
UK

Tel. +44 118 378 7722

<<attachment: NERC-ClusterGrid_POL1.jpg>>

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/

_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] gmetad mixing up nodes from different clusters

Reply via email to