I think this is a well thought out email and I'm a little surprised at the lack 
of response to it.  Is it because no one is actually using the gmond python 
module interface and hasn't had to make these types of decisions?  I don't have 
a single gmetric script/module that only collects one metric and have used both 
the mutli-threaded collector approach and the single-thread hopeful/naive 
approach (having written the mysql metric module you mention).  I agree that 
the multi-threaded design seems less prone to problems but in practice I 
haven't had any problems either way.  That being said, I will be trying some of 
your suggestions or moving to threads for that mysql module when time permits.

My frustration with the gmond python module interface is that it's not actually 
a complete replacement for gmetric scripts as I use them.  Needing to know all 
of the metrics that a module will report before runtime makes for a lot of 
upfront work creating .pyconf files and doesn't allow for adding new metrics 
without restarting gmond.  Being able to deploy one gmetric script that 
conditionally reports gmetrics based on what's running/hardware installed/etc 
on a box is a big advantage for a gmetric script over conditionally generating 
pyconf files and then still having to conditionally collect metrics in the 
actual gmond module.  I expect most users just stick with the gmetric script in 
this case and handle scheduling themselves?

-g



----- Original Message ----
> From: David Stainton <[email protected]>
> To: [email protected]
> Sent: Friday, January 23, 2009 11:43:32 AM
> Subject: [Ganglia-developers] gmond python module interface
> 
> Hi,
> 
> I've been thinking about the python module interface and how best to use it.
> Gmond uses a single thread that executes the callback function for
> every metric of every module
> in a scheduled fashion...
> This seems like a brittle design that won't scale for many metrics.
> If a developer writes a module that takes too long that would prevent other
> metric callbacks from being called.
> I was thinking the design should use threads to prevent a bad module
> from DOSing the rest of the modules.
> Either that or an enforced timer...
> 
> I maintain a largish cluster of mysql databases so I wanted to use Ganglia
> to get mysql stats. I found a python module for doing this :
> 
> http://g.raphaelli.com/2009/1/5/ganglia-mysql-metrics
> 
> This module provides a lot of useful mysql metrics from the output of
> about 5 mysql queries.
> I briefly audited the code and found interesting things.
> The callback function first calls update_stats() before returning the
> relevant metric.
> Obviously we only want update_stats() to cache data and only perform
> mysql queries
> after all the metric callbacks have read a metric from the cache.
> However, I noticed this at the beginning of update_stats():
> 
>     if time.time() - last_update < 15:
>         return True
>     else:
>         last_update = time.time()
> 
> This design assumes that the gmond metric scheduler will schedule all
> mysql metric callbacks within 15 seconds of the first mysql metric
> callback.
> This is probably a safe assumption but it still bothers me ;-)
> 
> I think gmond is unlikely to take longer than 15 seconds to call all
> the mysql metric callbacks.
> But a mysql database could easily take longer than 15 seconds to
> return results for the 5 queries.
> This would cause the callback function to execute mysql queries for each 
> metric.
> So a quick fix would be to measure the time after collecting the data
> from the mysql queries...
> 
> Or the module could be improved by removing the time measurement and instead
> marking each metric item as they are read by the callback function.
> When all metrics are finally marked as being read then the callback
> function will compute the metrics again.
> Mysql could block on a "SHOW SLAVE STATUS" which would then break my design
> by preventing gmond from running other callbacks for the other modules...
> 
> Another approach is the python threading method used in tcpconn.py:
> 
> - spawn a worker thread that caches data for many metrics
>   - worker thread uses a lock when update metric cache
> - metric callback function acquieres lock to read metric values from cache
>   - callback function blocks if lock is already acquired
> 
> In this case there would only be 2 threads competing for the lock so I
> guess it doesn't matter
> that the python Lock object
> (http://docs.python.org/library/threading.html) has no defined
> fairness scheduling...
> 
> I sort of like this approach. The callbacks can return immediately
> because they read data that is cached by a worker thread.
> I guess a problem with this design might be that a gmond would
> schedule metric callbacks out of sync with
> the worker thread collecting data.
> This could cause a race condition where metric callbacks might return
> old values while others return new values.
> 
> The prospect of collecting lots of metrics in a single module is a
> common pattern.
> I'm curious to see what others think of these issues.
> 
> 
> David Stainton
> 
> -- 
> Operations Engineer Spinn3r.com
> Location: San Francisco, CA
> YIM: mrdavids22
> Work: http://spinn3r.com
> Blog: http://david415.wordpress.com
> 
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by:
> SourcForge Community
> SourceForge wants to tell your story.
> http://p.sf.net/sfu/sf-spreadtheword
> _______________________________________________
> Ganglia-developers mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/ganglia-developers



      

------------------------------------------------------------------------------
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
_______________________________________________
Ganglia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-developers

Reply via email to