So, after trimming out a few data_source lines from gmetad.conf, I cranked
up my new 2.5.2-modified version again to see how long it lasted.
And the answer is ... about 45 minutes. (42, actually ... now you know
what the question is!)
I noticed similar stalling with 2.5.0, so I'm not sold on the idea of this
being specifically related to the Ganglia RRD functions. It's possible
that it could be some strange high-volume bug in my copy of librrd (which
hasn't been updated in the last few months - it's at 1.0.33).
It seems very odd that it always lasts about the same amount of time, no
matter when I start it. In other words, I'm pretty sure there's not a
rogue cron job out there making trouble.
I've run the old version with debug output and noticed that there were some
errors about corrupted RRD files. It looks like I had whole directories
full of zero-length RRDs. Oops, that could have been my fault. I just
went through and cleared them out. This is the 2.5.0 version and the RRD
updating seems to have visible slowed down. Two threads appear to be
trying to update their own RRD sets at the same time (contending for that
RRD library call lock, I'm sure). It's running in fits and starts.
So, after about 25 minutes of running that I decided to write it off, and
now I'm running the new N-less version instead. Wow. It's noticeably
faster on debug output.
I'm noticing that it keeps trying to create
$BIG_CLUSTER/$SOME_HOST/swap_total.rrd. Which seems a little funny. How
could all of these hosts *not* have the same RRD? I'm thinking that's a
umask/permissions problem.
And hmmm... the summary appears to now be broken. Well, it keeps getting
bigger and bigger ... oops?
Clearly I've got some work to do. Just wanted to let you all know what was up.