Re: [Ganglia-general] Ganglia 3.1.2 -- Modules fail to load intermittently, Hosts disappear from the web interface, XML errors that can't be found

Adam Tygart Fri, 15 May 2009 09:24:28 -0700

On Fri, May 15, 2009 at 10:28, Carlo Marcelo Arenas Belon
<care...@sajinet.com.pe> wrote:
> On Fri, May 15, 2009 at 08:42:33AM -0500, Adam Tygart wrote:
>> On Fri, May 15, 2009 at 04:32, Carlo Marcelo Arenas Belon
>> <care...@sajinet.com.pe> wrote:
>> > On Thu, May 14, 2009 at 11:47:41AM -0500, Adam Tygart wrote:
>> >
>> >> Since then I have been
>> >> plagued with (what looked like) data errors, mis-reporting swap usage
>> >> was the easiest to see.
>> >
>> > could you elaborate here?, is the value that gmond is collecting on each
>> > node incorrect?, is the aggregated in gmetad incorrect?, which one of the
>> > swap metrics is incorrect?
>>
>> Aggregate swap data being incorrect is the easiest to see.
>> Here is the graph from a mis-reporting host (it doesn't always even
>> send this information): http://imgur.com/io8gu.png
>>
>> Here is the resulting aggregate graph: http://imgur.com/trato.png
>> The beginning of this graph is showing the correct data, I simply
>> restarted gmond (on all non-webserver hosts), and the resulting swap
>> usage was from one of them failing to send the correct data.
>
> OK, the metric value is not incorrect, but is not being reported at all
> which is why you have dips on your graph that fix themselves after several
> minutes.
>
> This is sadly a known issue, because of the way that gmond register metrics
> dynamically and the fact that some of those metrics aren't refreshed that
> frequently as described in the Release Notes (mentioning as an example the
> CPU count issues which is very visible), for more details in the discussion
> look at :
>
>  http://www.mail-archive.com/ganglia-general@lists.sourceforge.net/msg04275.html
>
> An eventhough I agree it is a bug doesn't have yet a solution, and is not
> seen unless gmond is restarted (any of them)
>
> a workaround is available, but ensuring that if you have to restart a gmond
> you restart first its "collector" (the one that gmetad is looking at) and the
> rest are pointing to when using unicast, and restart ALL other gmond in the
> cluster after that.


I should have specified how I got the graph. I had everything
"working," modules were loaded, everything was being reported. In an
attempt to reproduce the issue I was having. I restarted the first
collector. This caused the gmond for this host to stop reporting. I
let it sit, this is when the large break in the graph occurred. I then
restarted gmond on this host. It then only reported cached memory. I
restarted it again. It then started reporting all memory statistics
correctly.

During this period, the aggregate graph showed a drop in hosts, and
then a recovery. The recovery was when gmond on the reporting hosts
were restarted the first time. The graph then shows an unusual amount
of swap usage. This is not the "real" data. Once I restarted gmond on
the mis-reporting host, again, the swap usage dropped.

>> >> The question I have is this: is this a known bug?
>> >
>> > some are, like the unicast send_metadata_interval or the cpu_count
>> > inconsistency as shown by the "Important Notes", some others might not be
>>
>> I haven't been able to find the "Important Notes" document, is there a
>> link to this somewhere?
>
> sadly it is buried at the bottom of the "Release Notes" now :
>
>  http://ganglia.wiki.sourceforge.net/ganglia_release_notes
>
> and yes I agree should be moved to a better place as well.
>
>> Is the cpu_count inconsistency the piece I mentioned about hosts
>> disappearing from the web interface?
>
> most likely the host disappearing from the web interface is because of
> the send_metadata_interval and you trying to restart the gmond to fix it.
>
> if it is not then we have a new bug ;)

The hosts are coming with every third or fourth manual refresh of the web-page.
Not all of them disappear, just some of them. Some hosts are more apt
to disappear than others: rogue2, janus, rogue10, rogue8 to name few.
(I have re-added hosts to make this effect more obvious).
If you would like to look at the page for yourself, it is located
here: http://beocat.cis.ksu.edu/
For reference, 27 hosts should show up now.

>> >> Is there something else I should try?
>> >
>> > rollback to 3.0, specially if you don't need the modules but want a more
>> > stable setup.
>>
>> This being Gentoo, I have no "easy" way of rolling back, as the 3.0.x
>> builds have been removed from their tree.
>
> OK, IMHO having ganglia 3.0 in their tree as well with a different slot
> might be a good idea, but sadly I haven't yet filed it as a bug or can
> provide a working ebuild in a public overlay yet as a solution either,
> but of course you can still do your own binaries/packages if needed.
>
> 3.0 is still under development with 3.0.8 going to be released sometime soon
> and future releases focusing mainly on stability and compatibility with 3.1,
> as well as supporting all other architectures that are not yet working in
> 3.1.

I have been tempted to roll-back, even if I have to roll my own build,
but I figured I would put in a real effort to make the new version
work for me.

>> The whole reason I upgraded was because I wanted to make use of the
>> python module support. I was previously using gmetric for monitoring
>> things like PBS job count and temperature on my nodes. After a week or
>> two of those scripts running, the load average on the systems started
>> to climb. After a month, the load average increase caused by gmetrics
>> was are 2-4 per host. A full 10% of my cluster's CPU utilization was
>> caused by gmetrics alone (all "system" cpu).
>
> most likely the spawn/fork cost and the fact that they were done with too
> much frequency, 3.1 modules might be a good solution for that, but if the
> metric collection is expensive anyway (and I would assume it is as I have
> never seen that much consumption from my own gmetric which are executed
> every second) then you are not going to solve the problem by just moving
> that expensive operation into gmond.

The application qstat from the Torque/Maui package is very resource
intensive, but I don't see why it would take more resources after a
month than it would at the beginning.

We are moving to SGE for our queuing system now, and python will be
able to do a lot of what I want to monitor natively now. And if I do
have to call qstat from python, this version is much easier on
resources.

--
Adam

------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables 
unlimited royalty-free distribution of the report engine 
for externally facing server and web deployment. 
http://p.sf.net/sfu/businessobjects
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] Ganglia 3.1.2 -- Modules fail to load intermittently, Hosts disappear from the web interface, XML errors that can't be found

Reply via email to