On Fri, May 15, 2009 at 04:32, Carlo Marcelo Arenas Belon
<care...@sajinet.com.pe> wrote:
> On Thu, May 14, 2009 at 11:47:41AM -0500, Adam Tygart wrote:
>
>> I have been having a hack of a time diagnosing this problem.
>
> I suspect there are several problems here, which OS and architecture?

Gentoo Linux, webserver is x86, everything else is x86_64

>> I recently updated to ganglia-3.1.2 for 3.0.7.
>
> 3.1 and 3.0 are not compatible and can't be on the same cluster, so for
> this upgrade to be successfull you should have done :
>
>  1) upgrade your gmetad/web to 3.1.2
>  2) upgrade all gmond to 3.1.2, cluster by cluster in batches
>
> more details to be found in :
>
>  http://ganglia.wiki.sourceforge.net/ganglia_release_notes

I already updated the entire cluster. My webserver is running the
proper versions of gmetad/web and everything is running the new
version of gmond.

>
>> Since then I have been
>> plagued with (what looked like) data errors, mis-reporting swap usage
>> was the easiest to see.
>
> could you elaborate here?, is the value that gmond is collecting on each
> node incorrect?, is the agregated in gmetad incorrect?, which one of the
> swap metrics is incorrect?

Aggregate swap data being incorrect is the easiest to see.
Here is the graph from a mis-reporting host (it doesn't always even
send this information): http://imgur.com/io8gu.png

Here is the resulting aggregate graph: http://imgur.com/trato.png
The beginning of this graph is showing the correct data, I simply
restarted gmond (on all non-webserver hosts), and the resulting swap
usage was from one of them failing to send the correct data.

>
> # uname -a
> Linux dell 2.6.28-gentoo-r5 #1 SMP Thu Apr 23 21:35:08 PDT 2009 x86_64 
> Intel(R) Core(TM)2 CPU 6320 @ 1.86GHz GenuineIntel GNU/Linux
> # gmond --version
> gmond 3.1.2
> # telnet 127.0.0.1 8649 | grep swap
> <METRIC NAME="swap_total" VAL="4008176" TYPE="float" UNITS="KB" TN="60" 
> TMAX="1200" DMAX="0" SLOPE="zero">
> <EXTRA_ELEMENT NAME="DESC" VAL="Total amount of swap space displayed in KBs"/>
> Connection closed by foreign host.
> <METRIC NAME="swap_free" VAL="4008176" TYPE="float" UNITS="KB" TN="60" 
> TMAX="180" DMAX="0" SLOPE="both">
> <EXTRA_ELEMENT NAME="DESC" VAL="Amount of available swap memory"/>
> # free | grep Swap
> Swap:      4008176          0    4008176
>
>> This seems to be caused by some reporting
>> modules failing to load. They fail silently, I don't see logs about it
>> anywhere, and when I turn debugging on I still don't see anything.
>
> AFAIK if a module fails to load because of an error it will just prevent
> gmond to start at all (some times silently) as detailed in the "Known Issues".
>
> if the module is not loaded but it is still referred by the configuration
> for collecting it will also be very noisy about it :
>
> # /etc/init.d/gmond start
>  * Starting GANGLIA gmond:  ...
> Cannot locate internal module structure 'mem_module' in file (null): 
> /usr/sbin/gmond: undefined symbol: mem_module
> Possibly an incorrect module language designation [(null)].
>                                                                          [ ok 
> ]
> # tail /var/log/syslog | grep gmond
> May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric 
> information for 'mem_total'. Possible that the module has not been loaded.
> May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric 
> information for 'swap_total'. Possible that the module has not been loaded.
> May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric 
> information for 'mem_free'. Possible that the module has not been loaded.
> May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric 
> information for 'mem_shared'. Possible that the module has not been loaded.
> May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric 
> information for 'mem_buffers'. Possible that the module has not been loaded.
> May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric 
> information for 'mem_cached'. Possible that the module has not been loaded.
> May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric 
> information for 'swap_free'. Possible that the module has not been loaded.
>
> what makes you think the module is not being loaded?, and that is being
> silent about that?, does it show in? :
>
>  # lsof -p `pidof gmond` | grep ganglia

I though the module wasn't being loaded because the host was not
sending any data that would be gathered by that module to my reporting
host. I can now see that it is being loaded, just not sending all of
the data.

gmond   32678 nobody  mem    REG      8,3   22928   330627
/usr/lib64/ganglia/modpython.so
gmond   32678 nobody  mem    REG      8,3   97312   330621
/usr/lib64/ganglia/modsys.so
gmond   32678 nobody  mem    REG      8,3   96992   330624
/usr/lib64/ganglia/modproc.so
gmond   32678 nobody  mem    REG      8,3   97184   330630
/usr/lib64/ganglia/modnet.so
gmond   32678 nobody  mem    REG      8,3   97408   330613
/usr/lib64/ganglia/modmem.so
gmond   32678 nobody  mem    REG      8,3   97088   330636
/usr/lib64/ganglia/modload.so
gmond   32678 nobody  mem    REG      8,3   97088   330615
/usr/lib64/ganglia/moddisk.so
gmond   32678 nobody  mem    REG      8,3   97632   330607
/usr/lib64/ganglia/modcpu.so
gmond   32678 nobody  mem    REG      8,3   91552   330616
/usr/lib64/libganglia-3.1.2.so.0.0.0

>
>> Usually it is one of the modules, but I have had two occasionally
>> happen at the same time. modmem.so and modnet.so are the two to most
>> commonly fail.
>
> what is observed when they "fail to load"?

They simply don't report data that should be reported, like memory
usage. If there are no rrds of previous data, the graph doesn't get
drawn, it is simply replaced by some placeholder text (possibly drawn
by rrd).

>> I have restarted with a new gmond configuration, changing only the
>> configuration of multicast to unicast, and this problem persists.
>
> this might have introduced another problem, for unicast to work somehow
> reliably you need to add a value for send_metadata_interval.

I have pushed out a change to the config with send_metadata_interval set to 60.

>> I have wiped my old rrd data. I have tried everything I know that could
>> even remotely be to blame for this problem.
>>
>> The question I have is this: is this a known bug?
>
> some are, like the unicast send_metadata_interval or the cpu_count
> inconsistency as shown by the "Important Notes", some others might not be

I haven't been able to find the "Important Notes" document, is there a
link to this somewhere?

Is the cpu_count inconsistency the piece I mentioned about hosts
disappearing from the web interface?

>> Is there something else I should try?
>
> rollback to 3.0, specially if you don't need the modules but want a more
> stable setup.

This being Gentoo, I have no "easy" way of rolling back, as the 3.0.x
builds have been removed from their tree.
The whole reason I upgraded was because I wanted to make use of the
python module support. I was previously using gmetric for monitoring
things like PBS job count and temperature on my nodes. After a week or
two of those scripts running, the load average on the systems started
to climb. After a month, the load average increase caused by gmetrics
was are 2-4 per host. A full 10% of my cluster's CPU utilization was
caused by gmetrics alone (all "system" cpu).

>> Can I force a module to be loaded?
>
> no, but a module should never fail to load "silently" AFAIK
>
>> When the modules do load, hosts report to gmond, and gmeta grabs that
>> data and logs it. My webserver then serves up the data through the
>> ganglia interface. The problem I am having here is that I get
>> intermittent xml errors, mostly saying that there is a missing > on
>> line $SomeLineNumber (always changes). Happens every 15 minutes or so.
>> I cannot reproduce any problems with the xml, however. I ran xmllint
>> on the xml 1 per second for an hour with no errors, during which time
>> the web interface failed to load twice.
>
> you migh be triggering the problem by running those tests if the gmetad
> summaryzation is overloaded and you don't have enough "server_threads".

Increasing the server threads seems to have fixed this problem.

> presume you found the failure in your web logs, which page failed to load?
> what is the load on the gmetad/web server? how long does downloading
> and parsing the XML tree took (it is in the footer)? what is the size
> of the XML?, do you have any remote gmetad that are being called for
> sumarization? might they be slow to answer due to network latency? does
> increasing the buffer size (hardcoded to 16384) in ganglia.php:344 help?

The load on the webserver fluctuates between 0-1.5, on a 2 processor box.
The size of the XML is 75K.
No remote gmetad.


> Carlo
>

------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables 
unlimited royalty-free distribution of the report engine 
for externally facing server and web deployment. 
http://p.sf.net/sfu/businessobjects
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to