Hi,

I've been thinking about the python module interface and how best to use it.
Gmond uses a single thread that executes the callback function for
every metric of every module
in a scheduled fashion...
This seems like a brittle design that won't scale for many metrics.
If a developer writes a module that takes too long that would prevent other
metric callbacks from being called.
I was thinking the design should use threads to prevent a bad module
from DOSing the rest of the modules.
Either that or an enforced timer...

I maintain a largish cluster of mysql databases so I wanted to use Ganglia
to get mysql stats. I found a python module for doing this :

http://g.raphaelli.com/2009/1/5/ganglia-mysql-metrics

This module provides a lot of useful mysql metrics from the output of
about 5 mysql queries.
I briefly audited the code and found interesting things.
The callback function first calls update_stats() before returning the
relevant metric.
Obviously we only want update_stats() to cache data and only perform
mysql queries
after all the metric callbacks have read a metric from the cache.
However, I noticed this at the beginning of update_stats():

        if time.time() - last_update < 15:
                return True
        else:
                last_update = time.time()

This design assumes that the gmond metric scheduler will schedule all
mysql metric callbacks within 15 seconds of the first mysql metric
callback.
This is probably a safe assumption but it still bothers me ;-)

I think gmond is unlikely to take longer than 15 seconds to call all
the mysql metric callbacks.
But a mysql database could easily take longer than 15 seconds to
return results for the 5 queries.
This would cause the callback function to execute mysql queries for each metric.
So a quick fix would be to measure the time after collecting the data
from the mysql queries...

Or the module could be improved by removing the time measurement and instead
marking each metric item as they are read by the callback function.
When all metrics are finally marked as being read then the callback
function will compute the metrics again.
Mysql could block on a "SHOW SLAVE STATUS" which would then break my design
by preventing gmond from running other callbacks for the other modules...

Another approach is the python threading method used in tcpconn.py:

- spawn a worker thread that caches data for many metrics
  - worker thread uses a lock when update metric cache
- metric callback function acquieres lock to read metric values from cache
  - callback function blocks if lock is already acquired

In this case there would only be 2 threads competing for the lock so I
guess it doesn't matter
that the python Lock object
(http://docs.python.org/library/threading.html) has no defined
fairness scheduling...

I sort of like this approach. The callbacks can return immediately
because they read data that is cached by a worker thread.
I guess a problem with this design might be that a gmond would
schedule metric callbacks out of sync with
the worker thread collecting data.
This could cause a race condition where metric callbacks might return
old values while others return new values.

The prospect of collecting lots of metrics in a single module is a
common pattern.
I'm curious to see what others think of these issues.


David Stainton

-- 
Operations Engineer Spinn3r.com
Location: San Francisco, CA
YIM: mrdavids22
Work: http://spinn3r.com
Blog: http://david415.wordpress.com

------------------------------------------------------------------------------
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
_______________________________________________
Ganglia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-developers

Reply via email to