Hi, I've been thinking about the python module interface and how best to use it. Gmond uses a single thread that executes the callback function for every metric of every module in a scheduled fashion... This seems like a brittle design that won't scale for many metrics. If a developer writes a module that takes too long that would prevent other metric callbacks from being called. I was thinking the design should use threads to prevent a bad module from DOSing the rest of the modules. Either that or an enforced timer...
I maintain a largish cluster of mysql databases so I wanted to use Ganglia to get mysql stats. I found a python module for doing this : http://g.raphaelli.com/2009/1/5/ganglia-mysql-metrics This module provides a lot of useful mysql metrics from the output of about 5 mysql queries. I briefly audited the code and found interesting things. The callback function first calls update_stats() before returning the relevant metric. Obviously we only want update_stats() to cache data and only perform mysql queries after all the metric callbacks have read a metric from the cache. However, I noticed this at the beginning of update_stats(): if time.time() - last_update < 15: return True else: last_update = time.time() This design assumes that the gmond metric scheduler will schedule all mysql metric callbacks within 15 seconds of the first mysql metric callback. This is probably a safe assumption but it still bothers me ;-) I think gmond is unlikely to take longer than 15 seconds to call all the mysql metric callbacks. But a mysql database could easily take longer than 15 seconds to return results for the 5 queries. This would cause the callback function to execute mysql queries for each metric. So a quick fix would be to measure the time after collecting the data from the mysql queries... Or the module could be improved by removing the time measurement and instead marking each metric item as they are read by the callback function. When all metrics are finally marked as being read then the callback function will compute the metrics again. Mysql could block on a "SHOW SLAVE STATUS" which would then break my design by preventing gmond from running other callbacks for the other modules... Another approach is the python threading method used in tcpconn.py: - spawn a worker thread that caches data for many metrics - worker thread uses a lock when update metric cache - metric callback function acquieres lock to read metric values from cache - callback function blocks if lock is already acquired In this case there would only be 2 threads competing for the lock so I guess it doesn't matter that the python Lock object (http://docs.python.org/library/threading.html) has no defined fairness scheduling... I sort of like this approach. The callbacks can return immediately because they read data that is cached by a worker thread. I guess a problem with this design might be that a gmond would schedule metric callbacks out of sync with the worker thread collecting data. This could cause a race condition where metric callbacks might return old values while others return new values. The prospect of collecting lots of metrics in a single module is a common pattern. I'm curious to see what others think of these issues. David Stainton -- Operations Engineer Spinn3r.com Location: San Francisco, CA YIM: mrdavids22 Work: http://spinn3r.com Blog: http://david415.wordpress.com ------------------------------------------------------------------------------ This SF.net email is sponsored by: SourcForge Community SourceForge wants to tell your story. http://p.sf.net/sfu/sf-spreadtheword _______________________________________________ Ganglia-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/ganglia-developers
