Re: [Ganglia-general] Ganglia 3.1.2 -- Modules fail to load intermittently, Hosts disappear from the web interface, XML errors that can't be found

Carlo Marcelo Arenas Belon Fri, 15 May 2009 08:32:16 -0700

On Fri, May 15, 2009 at 08:42:33AM -0500, Adam Tygart wrote:
> On Fri, May 15, 2009 at 04:32, Carlo Marcelo Arenas Belon
> <[email protected]> wrote:
> > On Thu, May 14, 2009 at 11:47:41AM -0500, Adam Tygart wrote:
> >
> >> Since then I have been
> >> plagued with (what looked like) data errors, mis-reporting swap usage
> >> was the easiest to see.
> >
> > could you elaborate here?, is the value that gmond is collecting on each
> > node incorrect?, is the aggregated in gmetad incorrect?, which one of the
> > swap metrics is incorrect?
> 
> Aggregate swap data being incorrect is the easiest to see.
> Here is the graph from a mis-reporting host (it doesn't always even
> send this information): http://imgur.com/io8gu.png
> 
> Here is the resulting aggregate graph: http://imgur.com/trato.png
> The beginning of this graph is showing the correct data, I simply
> restarted gmond (on all non-webserver hosts), and the resulting swap
> usage was from one of them failing to send the correct data.


OK, the metric value is not incorrect, but is not being reported at all
which is why you have dips on your graph that fix themselves after several
minutes.

This is sadly a known issue, because of the way that gmond register metrics
dynamically and the fact that some of those metrics aren't refreshed that
frequently as described in the Release Notes (mentioning as an example the
CPU count issues which is very visible), for more details in the discussion
look at :

  
http://www.mail-archive.com/[email protected]/msg04275.html
   
An eventhough I agree it is a bug doesn't have yet a solution, and is not
seen unless gmond is restarted (any of them)

a workaround is available, but ensuring that if you have to restart a gmond
you restart first its "collector" (the one that gmetad is looking at) and the
rest are pointing to when using unicast, and restart ALL other gmond in the
cluster after that.

> >> The question I have is this: is this a known bug?
> >
> > some are, like the unicast send_metadata_interval or the cpu_count
> > inconsistency as shown by the "Important Notes", some others might not be
> 
> I haven't been able to find the "Important Notes" document, is there a
> link to this somewhere?

sadly it is buried at the bottom of the "Release Notes" now :

  http://ganglia.wiki.sourceforge.net/ganglia_release_notes

and yes I agree should be moved to a better place as well.

> Is the cpu_count inconsistency the piece I mentioned about hosts
> disappearing from the web interface?

most likely the host disappearing from the web interface is because of
the send_metadata_interval and you trying to restart the gmond to fix it.

if it is not then we have a new bug ;)

> >> Is there something else I should try?
> >
> > rollback to 3.0, specially if you don't need the modules but want a more
> > stable setup.
> 
> This being Gentoo, I have no "easy" way of rolling back, as the 3.0.x
> builds have been removed from their tree.

OK, IMHO having ganglia 3.0 in their tree as well with a different slot
might be a good idea, but sadly I haven't yet filed it as a bug or can
provide a working ebuild in a public overlay yet as a solution either,
but of course you can still do your own binaries/packages if needed.

3.0 is still under development with 3.0.8 going to be released sometime soon
and future releases focusing mainly on stability and compatibility with 3.1,
as well as supporting all other architectures that are not yet working in
3.1.

> The whole reason I upgraded was because I wanted to make use of the
> python module support. I was previously using gmetric for monitoring
> things like PBS job count and temperature on my nodes. After a week or
> two of those scripts running, the load average on the systems started
> to climb. After a month, the load average increase caused by gmetrics
> was are 2-4 per host. A full 10% of my cluster's CPU utilization was
> caused by gmetrics alone (all "system" cpu).

most likely the spawn/fork cost and the fact that they were done with too
much frequency, 3.1 modules might be a good solution for that, but if the
metric collection is expensive anyway (and I would assume it is as I have
never seen that much consumption from my own gmetric which are executed
every second) then you are not going to solve the problem by just moving
that expensive operation into gmond.

Carlo

------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables 
unlimited royalty-free distribution of the report engine 
for externally facing server and web deployment. 
http://p.sf.net/sfu/businessobjects
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] Ganglia 3.1.2 -- Modules fail to load intermittently, Hosts disappear from the web interface, XML errors that can't be found

Reply via email to