steve has been on this problem since the beginning without knowing it. the functions in librrd.a are not reentrant because they rely on the getopt library and getopt globals which are shared between threads.
i'm amazed that i didn't see this problem before. i've been testing gmetad on my personal machine (monitoring about 45 hosts in 3 clusters) and the planet lab guys site (http://www.planet-lab.org/ganglia.beta/) with about (84 hosts in 34 clusters). btw, the planet lab group is stressing gmetad in new way cool ways. they are linking clusters all over the world (right now the U.S., Italy and U.K). both test site worked.. i guess because the random sleeps in the threads keep them for contending too often. HOWEVER when federico at SDSC (who is monitoring 461 hosts in 13 clusters) tried the new gmetad it blew chucks. i think that steve likely has a whole mess of machines too which is why he had the problem. lesson learned: in the future try new versions of gmetad in high-stress environments. i've updated the CVS to include a mutex around the rrd functions which fixes the problem. i'm hoping the contention for this mutex will be low given that each thread is working via its own random clock. if you want to latest distribution tarball, visit http://matt-massie.com/download/ try it out and let me know what you find. -matt
