steve has been on this problem since the beginning without knowing it.  
the functions in librrd.a are not reentrant because they rely on the 
getopt library and getopt globals which are shared between threads. 

i'm amazed that i didn't see this problem before.  i've been testing
gmetad on my personal machine (monitoring about 45 hosts in 3 clusters)  
and the planet lab guys site (http://www.planet-lab.org/ganglia.beta/)
with about (84 hosts in 34 clusters).

btw, the planet lab group is stressing gmetad in new way cool ways.  they 
are linking clusters all over the world (right now the U.S., Italy and 
U.K).

both test site worked.. i guess because the random sleeps in the threads 
keep them for contending too often.

HOWEVER
when federico at SDSC (who is monitoring 461 hosts in 13 clusters) tried 
the new gmetad it blew chucks.  i think that steve likely has a whole mess 
of machines too which is why he had the problem.  

lesson learned: in the future try new versions of gmetad in high-stress 
environments.

i've updated the CVS to include a mutex around the rrd functions which 
fixes the problem.  i'm hoping the contention for this mutex will be low 
given that each thread is working via its own random clock.

if you want to latest distribution tarball, visit
http://matt-massie.com/download/

try it out and let me know what you find.
-matt


Reply via email to