So, after trimming out a few data_source lines from gmetad.conf, I cranked up my new 2.5.2-modified version again to see how long it lasted.

And the answer is ... about 45 minutes. (42, actually ... now you know what the question is!)

I noticed similar stalling with 2.5.0, so I'm not sold on the idea of this being specifically related to the Ganglia RRD functions. It's possible that it could be some strange high-volume bug in my copy of librrd (which hasn't been updated in the last few months - it's at 1.0.33).

It seems very odd that it always lasts about the same amount of time, no matter when I start it. In other words, I'm pretty sure there's not a rogue cron job out there making trouble.

I've run the old version with debug output and noticed that there were some errors about corrupted RRD files. It looks like I had whole directories full of zero-length RRDs. Oops, that could have been my fault. I just went through and cleared them out. This is the 2.5.0 version and the RRD updating seems to have visible slowed down. Two threads appear to be trying to update their own RRD sets at the same time (contending for that RRD library call lock, I'm sure). It's running in fits and starts.

So, after about 25 minutes of running that I decided to write it off, and now I'm running the new N-less version instead. Wow. It's noticeably faster on debug output.

I'm noticing that it keeps trying to create $BIG_CLUSTER/$SOME_HOST/swap_total.rrd. Which seems a little funny. How could all of these hosts *not* have the same RRD? I'm thinking that's a umask/permissions problem.

And hmmm... the summary appears to now be broken. Well, it keeps getting bigger and bigger ... oops?

Clearly I've got some work to do.  Just wanted to let you all know what was up.


Reply via email to