Re: [Ganglia-general] nested grids questions/issues

Ozzie Sabina Sun, 26 Feb 2012 12:06:28 -0800

I'm glad I remembered reading something about this here recently.  I added 
myself to Bug 324, but I figured I'd go ahead and echo the issue here as well.


I'm on CentOS 5, same exact issue with 3.3.1 for only the aggregation gmetads: 
No summaries are produced unless "scalable off" is set, in which case the grids 
are combined into one, which is not what we want. 

Reverting them to 3.1.7 gets me back to working as designed.

My debug output pattern is identical to that reported by Alexander (with 3.3.1 
vs. 3.1.7).

So it seems like 3.3.1 (3.2.0+) is unusable for grid summaries on at least 
several Unices.

I suppose I could cross-post to the developer list, but is there anyone 
familiar with this code here that might be able to hand out some clue?  I'm 
happy to help in any way I can.

Thanks!

Oz

On Feb 22, 2012, at 9:07 AM, Matthew Nicholson wrote:
> Yeah, telnetting to the "remote" gmetad's is just fine. Its as soon as
> I upgrade the gmetad talking to those, i stop getting info.
> 
> Bug report time...
> 
> On Wed, Feb 22, 2012 at 5:39 AM, Alexander Karner <a...@de.ibm.com> wrote:
>> Hi!
>> 
>> Please check if your remote gmetad's export data by running
>> telnet <host> <xml port>
>> 
>> --> I had a similar behaviour as I upgraded my central gmetad to 3.2.0. No
>> data was collected from the other grids but running the telnet command
>> returned a long list of XML data.
>> Switching back to 3.1.7 solved the problem.
>> 
>> If your remote systems are able to export data on the XML port, your central
>> gmetad seems to have the same problem that I had
>> 
>> 
>> 
>> 
>> From:        Matthew Nicholson <matthew.a.nichol...@gmail.com>
>> To:        ganglia-general@lists.sourceforge.net,
>> Date:        19.02.2012 19:07
>> Subject:        [Ganglia-general] nested grids questions/issues
>> ________________________________
>> 
>> 
>> 
>> So, I recently inherited a large ganglia installation we use to
>> monitor out HPC cluster and associated services. Due to our module, we
>> have the needs to break our cluster and storage up into smaller
>> "clusters" that serve specific purposes, aggregate those into grids ,
>> and then those grids into another "master" level grid, though in one
>> case there is a 3rd level of grid aggregation.
>> 
>> This is all unicast based, and we (I' sure I'll be told to do other
>> wise, but thats not an option currently), run ~55 gmond's and ~
>> gmetads on our "ganglia" box. Everything communicates to this on a
>> range of unicast ports.
>> 
>> More info on our nesting:
>> Master(gmetad) -> 3 other gmetad's -> lots and lots of gmond's
>>                        -> 1 gmetad for storage -> 2 gmetads (lustre +
>> nfs) -> lots of gmonds
>> 
>> Thats basically it.
>> 
>> Okay, so this works. It is currently working, but, the gmetad's fall
>> over form time to time, and is running ganglia 3.1.4. We would like to
>> get everything up to 3.3.0/1, and update our web frontend as well.
>> 
>> I've been updating the gmond's service side without issues, and the
>> immediate parent gmetads (that is, gmetad's that only collect from
>> gmond's) also without issue.
>> 
>> However, as soon as I restart a gmetad that polls other gmetads (the
>> gmetad_storage, for example), I get no summary information at all. The
>> only change is I'm starting a different binary in the init script. It
>> runs/starts without error, and with debugging, I get:
>> 
>> Going to run as user nobody
>> Sources are ...
>> Source: [NFS, step 15] has 1 sources
>>                 127.0.0.1
>> Source: [Lustre, step 15] has 1 sources
>>                 127.0.0.1
>> xml listening on port 8657
>> interactive xml listening on port 8658
>> cleanup thread has been started
>> Data thread 1168345408 is monitoring [NFS] data source
>>                 127.0.0.1
>> Data thread 1178835264 is monitoring [Lustre] data source
>>                 127.0.0.1
>> 
>> Where, as, with the older, 3.1.4 binary:
>> Going to run as user nobody
>> Sources are ...
>> Source: [NFS, step 15] has 1 sources
>>                 127.0.0.1
>> Source: [Lustre, step 15] has 1 sources
>>                 127.0.0.1
>> xml listening on port 8657
>> interactive xml listening on port 8658
>> Data thread 1170368832 is monitoring [NFS] data source
>>                 127.0.0.1
>> Data thread 1180858688 is monitoring [Lustre] data source
>>                 127.0.0.1
>> cleanup thread has been started
>> [NFS] is a 2.5 or later data stream
>> hash_create size = 50
>> hash->size is 53
>> Found a <GRID>, depth is now 1
>> Found a </GRID>, depth is now 0
>> Writing Summary data for source NFS, metric
>> storage_local__nfs_cleanenergy1_size
>> Writing Summary data for source NFS, metric disk_free
>> Writing Summary data for source NFS, metric
>> storage_local__nfs_nobackup2_percent_used
>> Writing Summary data for source NFS, metric storage_local__itc1_percent_used
>> Writing Summary data for source NFS, metric storage_local__mnt_emcback7_size
>> Writing Summary data for source NFS, metric
>> storage_local__nfs_atlascode_size
>> Writing Summary data for source NFS, metric bytes_out
>> etc etc etc
>> 
>> 
>> I've been unable to find much on issues like this, no noted changes to
>> the way gmetad can read downstream gmetads, and no obvious config
>> options in 3.3.0.
>> 
>> Am I missing something?
>> I'll happily provide gmetad configs if needed.
>> 
>> --
>> Matthew Nicholson
>> 
>> ------------------------------------------------------------------------------
>> Virtualization & Cloud Management Using Capacity Planning
>> Cloud computing makes use of virtualization - but cloud computing
>> also focuses on allowing computing to be delivered as a service.
>> http://www.accelacomm.com/jaw/sfnl/114/51521223/
>> _______________________________________________
>> Ganglia-general mailing list
>> Ganglia-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>> 
>> 
>> 
>> ------------------------------------------------------------------------------
>> Virtualization & Cloud Management Using Capacity Planning
>> Cloud computing makes use of virtualization - but cloud computing
>> also focuses on allowing computing to be delivered as a service.
>> http://www.accelacomm.com/jaw/sfnl/114/51521223/
>> _______________________________________________
>> Ganglia-general mailing list
>> Ganglia-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>> 
> 
> 
> 
> -- 
> Matthew Nicholson
> 
> ------------------------------------------------------------------------------
> Virtualization & Cloud Management Using Capacity Planning
> Cloud computing makes use of virtualization - but cloud computing 
> also focuses on allowing computing to be delivered as a service.
> http://www.accelacomm.com/jaw/sfnl/114/51521223/
> _______________________________________________
> Ganglia-general mailing list
> Ganglia-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/ganglia-general


------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] nested grids questions/issues

Reply via email to