Re: [Ganglia-general] gmetad mixing up nodes from different clusters

Matthias Blankenhaus Mon, 05 Nov 2007 11:52:30 -0800


On Mon, 5 Nov 2007, Dan Bretherton wrote:


> Dear Matthias,
> >
> > Ok, so then the title of this posting is kinda misleading :-)  In fact,
> > gmetad is then not mixing up the nodes.
> 
> I think I see where you are getting confused.  There are two different types 
> of daemon involved in Ganglia - gmond and gmetad.  Gmond monitors the nodes 
> in a single cluster.  The XML data I posted was from gmond.  The other 
> daemon, gmetad, brings together information about the different clusters from 
> several different gmond data sources (in the case of a simple grid like 
> mine).  The gmetad daemon provides information to the Web frontend, which 
> shows information about the grid as a whole, and (if it is working properly) 
> each different cluster in the grid.  I have one gmetad in the grid and it 
> runs on the Web frontend server.

I actually have the same setup :)  And I am well aware of the difference 
of these two daemons.

> 
> >
> > Mh, what's your problem again ? :-)
> >
> In my situation gmond is OK.  The XML data I posted was from gmond running on 
> one of the clusters in the grid.  The nodes get mixed up somewhere during the 
> process of bringing together data from the separate gmonds on the three 
> clusters and arranging it in the form of web pages.  I thought at first that 
> my gmetad was mixing up the nodes from different clusters because gmetad.conf 
> is where the location of the gmond data sources are specified.
> 
> You might have missed my second posting 
> (http://www.mail-archive.com/[email protected]/msg03325.html),
>  
> after someone old me how to get at the data produced by gmetad.  Gmetad XML 
> data is different to gmond XML data; it includes information about all the 
> clusters in the grid, not just one cluster.  I didn't post any gmetad output 
> but I did find that the XML data produced by my gmetad appears to be correct. 
>  
> I presume that means the mix up occurs in the Web frontend itself.
> 

Ok.  I looked at your web link, but again still did not see nodes being 
mixed up.  I went to your three cluster and saw the nodes listed according 
to your gmond.conf files.  If indeed, the XML output from gmetad looks 
fine, e.g. the cluster nodes are listed according to their cluster, and 
you experience a mix-up in the GUI, again I was not able to see that, then 
yes, it can only be something with the presentation layer or your RRD
records are incorrect.


Cheers,
Matthias

> Regards,
> -Dan Bretherton.
> 
> >
> >
> > Matthias
> >
> > > POL cluster nodes:
> > > =================
> > > node001.beowulf.cluster
> > > node002.beowulf.cluster
> > >
> > > node090.beowulf.cluster
> > > file01.beowulf.cluster
> > > file02.beowulf.cluster
> > > nemo.beowulf.cluster
> > > nemo2.beowulf.cluster
> > >
> > > BAS cluster nodes:
> > > =================
> > > node001.beowulf.cluster
> > > node002.beowulf.cluster
> > >
> > > node032.beowulf.cluster
> > > bslhadesws1.beowulf.cluster
> > > bslhadesws2.beowulf.cluster
> > > bslhadesws3.beowulf.cluster
> > > bslhadesws4.beowulf.cluster
> > > bslhadesws5.beowulf.cluster
> > > quad001.beowulf.cluster
> > > quad002.beowulf.cluster
> > > quad003.beowulf.cluster
> > > quad004.beowulf.cluster
> > > master.beowulf.cluster
> > > db01.beowulf.cluster
> > >
> > > ESSC cluster nodes:
> > > =================
> > > node001.beowulf.cluster
> > > node002.beowulf.cluster
> > >
> > > node016.beowulf.cluster
> > > node101.beowulf.cluster
> > > node102.beowulf.cluster
> > > node103.beowulf.cluster
> > > node104.beowulf.cluster
> > > master.beowulf.cluster
> > > storage.beowulf.cluster
> > >
> > > I checked the POL gmond XML data again today to verify that none of the
> > > other clusters' nodes were listed.  I also checked the load_one
> > > measurements for every POL node against the correct values from POL's
> > > internal Ganglia Webfrontend.  I found no evidence of incorrect nodes or
> > > load values in POL's gmond XML data.
> > >
> > > > the question arises how do you separate the cluster into cluster-local
> > > > domains ?  In other words, you somehow need to ensure that only the
> > > > nodes from POL talk to the gmonds running on POL.
> > >
> > > I'm pretty sure that is the situation we have now.
> > >
> > > > Can all nodes talk to each other directly ?
> > >
> > > No. I can't think of a way that could possibly happen and I haven't found
> > > any evidence for it in the gmond XML data.  I have checked the gmond data
> > > from all three clusters for evidence of nodes being mixed up.  All three
> > > clusters are behind their institutional firewalls and the BAS cluster
> > > data comes here via a SSH tunnel (as POL's was too until recently).  I
> > > should add that the POL cluster report page is correct if I remove the
> > > other data sources from gmetad.conf, which suggests to me that the
> > > problem is with my Web frontend rather than POL's gmond.
> > >
> > > > Maybe you want to consider
> > > > using different mcast IPs for the different cluster ?
> > > >
> > > > In prinipal, I would first simplify the gmond.conf files and then play
> > > > with the mcast addresses.  If that starts to work, then I would add the
> > > > access control.
> > >
> > > Thanks for the suggestions.  Do you have any other ideas in the light of
> > > the above?
> > >
> > > Regards,
> > > -Dan.
> > >
> > > > Matthias
> > > >
> > > > > -Dan.
> > > > >
> > > > > On Wednesday 31 Oct 2007 19:09, Matthias Blankenhaus wrote:
> > > > > > Dan,
> > > > > >
> > > > > > could you post the relevant snippets from gmond.conf from your
> > > > > > cluster nodes ?
> > > > > >
> > > > > > What is the XML output from gmond on the POL cluster ?
> > > > > >
> > > > > > Thanx,
> > > > > > Matthias
> > > > > >
> > > > > > On Wed, 31 Oct 2007, Dan Bretherton wrote:
> > > > > > > Dear All,
> > > > > > >
> > > > > > > Here are some updates to the message I posted to the list
> > > > > > > yesterday:
> > > > > > >
> > > > > > > 1) The XML data from gmetad seems to be correct.  I got this data
> > > > > > > from "telnet localhost 8651".  I can't see any incorrect nodes
> > > > > > > listed under POL, so I now suspect the Web frontend rather than
> > > > > > > gmetad.
> > > > > > >
> > > > > > > 2) the data in /var/lib/ganglia/rrds seems to be correct.  There
> > > > > > > are no incorrect nodes listed in the directory for the POL
> > > > > > > cluster.  This also points the finger at the Web frontend.
> > > > > > >
> > > > > > > 3) I have tried out the latest versions of gmetad and the Web
> > > > > > > frontend (3.0.5) with the latest version of rrdtool (1.2.23) on
> > > > > > > another computer to make sure the problem is not being caused by
> > > > > > > a bug that has been fixed.  I found that the same problem occurs
> > > > > > > with the latest versions so I have left the public server
> > > > > > > (http://www.resc.reading.ac.uk/ganglia/) on ganglia version 3.0.3
> > > > > > > and rrdtool version 1.2.15
> > > > > > >
> > > > > > > 4) The BAS cluster nodes actually have different IP addresses to
> > > > > > > the POL nodes of the same name, so the IP addresses are not the
> > > > > > > cause of the BAS nodes being listed in the POL cluster report.
> > > > > > >
> > > > > > > Regards,
> > > > > > > -Dan.
> > > > > > >
> > > > > > > On Tuesday 30 Oct 2007, you wrote:
> > > > > > > > Dear All,
> > > > > > > >
> > > > > > > > This is the first time I have posted to the list, but I have
> > > > > > > > made good use of the archives on many occasions.  Unfortunately
> > > > > > > > I can't find anything in the archives to help with my current
> > > > > > > > problem.
> > > > > > > >
> > > > > > > > I am monitoring a grid consisting of clusters at three
> > > > > > > > institutions called POL, BAS and ESSC.  The clusters are all
> > > > > > > > from the same supplier and use the same convention for slave
> > > > > > > > node IP addresses and host names. All the clusters are behind
> > > > > > > > their own institutional firewalls.  My Ganglia Web frontend is
> > > > > > > > at the following address:
> > > > > > > > http://www.resc.reading.ac.uk/ganglia/
> > > > > > > >
> > > > > > > > My problem is that the POL cluster report mixes up nodes from
> > > > > > > > all three clusters.  The POL cluster is listed as "NEMO cluster
> > > > > > > > @ POL" on the grid report page of my Web frontend. There are
> > > > > > > > three main problems with the POL cluster report:
> > > > > > > > 1)  Nodes at ESSC and BAS with names not found at POL usually
> > > > > > > > show up as blank spaces on the POL cluster page unless they are
> > > > > > > > down, in which case they are represented by the usual pink box
> > > > > > > > 2) The load level colouring (and hence the positioning on the
> > > > > > > > page) of nodes that have the same name as nodes in other
> > > > > > > > clusters is often governed by the other clusters
> > > > > > > > 3) The overview section of the POL cluster report has incorrect
> > > > > > > > values for load percentages and number of CPUs etc.
> > > > > > > >
> > > > > > > > Here is an excerpt from my gmetad.conf file showing the three
> > > > > > > > data sources. The host names have been changed for security
> > > > > > > > reasons.
> > > > > > > >
> > > > > > > > data_source "POL's gmond" 65 pol.host.name:8649
> > > > > > > > data_source "ESSC's gmond" 60 essc.host.name:8649
> > > > > > > > data_source "BAS's gmond through SSH tunnel" 70 localhost:8647
> > > > > > > >
> > > > > > > > Here is some more information I think may be relevant.
> > > > > > > > -- The ESSC cluster is on the same subnet as my Web frontend
> > > > > > > > server -- There are no problems with the ESSC and BAS cluster
> > > > > > > > reports -- The XML data received from POL's gmond is correct
> > > > > > > > -- My gmetad version is 3.0.3, but I get the same problem on my
> > > > > > > > backup gmetad machine which still has version 2.5.7
> > > > > > > > -- POL's gmond is version 3.0.3, but ESSC and BAS have gmond
> > > > > > > > version 2.5.7 -- Accessing POL's gmond through a different port
> > > > > > > > via an SSH tunnel (i.e. localhost:8648 instead of
> > > > > > > > pol.host.name:8649) makes no difference -- Changing the order
> > > > > > > > of the data sources in gmetad.conf makes no difference --
> > > > > > > > Removing either the ESSC or the BAS data source makes no
> > > > > > > > difference; the POL cluster report still gets mixed up with the
> > > > > > > > other cluster, which ever one it is -- Deleting all the RRD
> > > > > > > > files in /var/lib/ganglia/rrds/ and starting again makes no
> > > > > > > > difference
> > > > > > > > -- The grid report page has correct values for the POL cluster
> > > > > > > >
> > > > > > > > I could change the host names and IP addresses of the ESSC
> > > > > > > > cluster nodes, but that wouldn't stop the POL cluster report
> > > > > > > > getting confused with BAS nodes and changing those clusters is
> > > > > > > > not an option.  Is there any way to solve this problem without
> > > > > > > > making the node names of all the clusters different?  All
> > > > > > > > suggestions would be gratefully received.  I hope I haven't
> > > > > > > > missed something obvious.
> > > > > > > >
> > > > > > > > -Dan Bretherton.
> > > > > > >
> > > > > > > --
> > > > > > > Mr. D.A. Bretherton
> > > > > > > Environmental Systems Science Centre
> > > > > > > Harry Pitt Building
> > > > > > > 3 Earley Gate
> > > > > > > Reading University
> > > > > > > Reading, RG6 6AL
> > > > > > > UK
> > > > > > >
> > > > > > > Tel. +44 118 378 7722
> > > > > > >
> > > > > > > -----------------------------------------------------------------
> > > > > > >---- ---- This SF.net email is sponsored by: Splunk Inc.
> > > > > > > Still grepping through log files to find problems?  Stop.
> > > > > > > Now Search log events and configuration files using AJAX and a
> > > > > > > browser. Download your FREE copy of Splunk now >>
> > > > > > > http://get.splunk.com/
> > > > > > > _______________________________________________
> > > > > > > Ganglia-general mailing list
> > > > > > > [email protected]
> > > > > > > https://lists.sourceforge.net/lists/listinfo/ganglia-general
> > > > >
> > > > > --
> > > > > Mr. D.A. Bretherton
> > > > > Environmental Systems Science Centre
> > > > > Harry Pitt Building
> > > > > 3 Earley Gate
> > > > > Reading University
> > > > > Reading, RG6 6AL
> > > > > UK
> > > > >
> > > > > Tel. +44 118 378 7722
> > >
> > > --
> > > Mr. D.A. Bretherton
> > > Environmental Systems Science Centre
> > > Harry Pitt Building
> > > 3 Earley Gate
> > > Reading University
> > > Reading, RG6 6AL
> > > UK
> > >
> > > Tel. +44 118 378 7722
>

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/

_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] gmetad mixing up nodes from different clusters

Reply via email to