On Thu, May 14, 2009 at 11:47:41AM -0500, Adam Tygart wrote:

> I have been having a hack of a time diagnosing this problem.

I suspect there are several problems here, which OS and architecture?

> I recently updated to ganglia-3.1.2 for 3.0.7.

3.1 and 3.0 are not compatible and can't be on the same cluster, so for
this upgrade to be successfull you should have done :

  1) upgrade your gmetad/web to 3.1.2
  2) upgrade all gmond to 3.1.2, cluster by cluster in batches

more details to be found in :

  http://ganglia.wiki.sourceforge.net/ganglia_release_notes

> Since then I have been
> plagued with (what looked like) data errors, mis-reporting swap usage
> was the easiest to see.

could you elaborate here?, is the value that gmond is collecting on each
node incorrect?, is the agregated in gmetad incorrect?, which one of the
swap metrics is incorrect?

# uname -a
Linux dell 2.6.28-gentoo-r5 #1 SMP Thu Apr 23 21:35:08 PDT 2009 x86_64 Intel(R) 
Core(TM)2 CPU 6320 @ 1.86GHz GenuineIntel GNU/Linux
# gmond --version
gmond 3.1.2
# telnet 127.0.0.1 8649 | grep swap
<METRIC NAME="swap_total" VAL="4008176" TYPE="float" UNITS="KB" TN="60" 
TMAX="1200" DMAX="0" SLOPE="zero">
<EXTRA_ELEMENT NAME="DESC" VAL="Total amount of swap space displayed in KBs"/>
Connection closed by foreign host.
<METRIC NAME="swap_free" VAL="4008176" TYPE="float" UNITS="KB" TN="60" 
TMAX="180" DMAX="0" SLOPE="both">
<EXTRA_ELEMENT NAME="DESC" VAL="Amount of available swap memory"/>
# free | grep Swap
Swap:      4008176          0    4008176

> This seems to be caused by some reporting
> modules failing to load. They fail silently, I don't see logs about it
> anywhere, and when I turn debugging on I still don't see anything.

AFAIK if a module fails to load because of an error it will just prevent
gmond to start at all (some times silently) as detailed in the "Known Issues".

if the module is not loaded but it is still referred by the configuration
for collecting it will also be very noisy about it :

# /etc/init.d/gmond start
 * Starting GANGLIA gmond:  ...
Cannot locate internal module structure 'mem_module' in file (null): 
/usr/sbin/gmond: undefined symbol: mem_module
Possibly an incorrect module language designation [(null)].
                                                                          [ ok ]
# tail /var/log/syslog | grep gmond
May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric 
information for 'mem_total'. Possible that the module has not been loaded.  
May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric 
information for 'swap_total'. Possible that the module has not been loaded.  
May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric 
information for 'mem_free'. Possible that the module has not been loaded.  
May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric 
information for 'mem_shared'. Possible that the module has not been loaded.  
May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric 
information for 'mem_buffers'. Possible that the module has not been loaded.  
May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric 
information for 'mem_cached'. Possible that the module has not been loaded.  
May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric 
information for 'swap_free'. Possible that the module has not been loaded.

what makes you think the module is not being loaded?, and that is being
silent about that?, does it show in? :

  # lsof -p `pidof gmond` | grep ganglia

> Usually it is one of the modules, but I have had two occasionally
> happen at the same time. modmem.so and modnet.so are the two to most
> commonly fail.

what is observed when they "fail to load"?

> I have restarted with a new gmond configuration, changing only the
> configuration of multicast to unicast, and this problem persists.

this might have introduced another problem, for unicast to work somehow
reliably you need to add a value for send_metadata_interval.

> I have wiped my old rrd data. I have tried everything I know that could
> even remotely be to blame for this problem.
> 
> The question I have is this: is this a known bug?

some are, like the unicast send_metadata_interval or the cpu_count
inconsistency as shown by the "Important Notes", some others might not be

> Is there something else I should try?

rollback to 3.0, specially if you don't need the modules but want a more
stable setup.

> Can I force a module to be loaded?

no, but a module should never fail to load "silently" AFAIK

> When the modules do load, hosts report to gmond, and gmeta grabs that
> data and logs it. My webserver then serves up the data through the
> ganglia interface. The problem I am having here is that I get
> intermittent xml errors, mostly saying that there is a missing > on
> line $SomeLineNumber (always changes). Happens every 15 minutes or so.
> I cannot reproduce any problems with the xml, however. I ran xmllint
> on the xml 1 per second for an hour with no errors, during which time
> the web interface failed to load twice.

you migh be triggering the problem by running those tests if the gmetad
summaryzation is overloaded and you don't have enough "server_threads".

presume you found the failure in your web logs, which page failed to load? 
what is the load on the gmetad/web server? how long does downloading
and parsing the XML tree took (it is in the footer)? what is the size
of the XML?, do you have any remote gmetad that are being called for
sumarization? might they be slow to answer due to network latency? does
increasing the buffer size (hardcoded to 16384) in ganglia.php:344 help?

Carlo

------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables 
unlimited royalty-free distribution of the report engine 
for externally facing server and web deployment. 
http://p.sf.net/sfu/businessobjects
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to