matt massie wrote:
steve-
your idea is right on. i'm actually embarrassed that i didn't think of
that before.
man... if you only have 128 node cluster w/30 standard metrics.. that
means that every 15 seconds you are making 3840 gettimeofday() system
calls. ouch!
there is also another more subtle problem. the cluster timestamp in the
XML was added to handle clusters scattered over different timezones. your
method is right on because it will ensure that the time line for metrics
matches the real time on the remote clusters (and not the time of the
machine collecting the cluster data).
also... your method ensures the data values match exactly the time they
are valid... the old method is VERy sensitive to the time of parsing the
XML and saving to the databases.
we need to wrap that change up into a release and get that out to the
users soon. actually, federico wrapped up a 2.5.2 release and i dropped
the ball on getting it the rest of the way out the door. i have been
swamped with meeting, conferences and working on ganglia 3 lately... i
think it might be smart if i focus back on getting ganglia 2.5.x cleaned
up and ready for a good maintenance release.
do you have a patch for the changes against the latest CVS source? how
has your fix been working for you lately?
I'd been meaning to do a follow-up on this. I'm having a problem with RRDs
not being written to over time.
At first I thought, "Oh, crap. I've tampered with rrd_tools.c and this is
clearly a case of divine retribution."
So I swapped out gmetad binaries, and the same damn thing started
happening, only this time I had real gappy data for about 40-45 minutes and
then it stopped entirely (while gmetad still runs).
What I should have done was check the timestamps to see whether the data
source threads were dying off entirely or whether it was just a problem
updating RRDs, but I didn't do that because it would have made sense and
been beneficial from a troubleshooting standpoint and who needs that?
Other people have seemed to have similar problems with various versions of
gmetad doing this, if memory serves me correctly (*bites yellow bell
pepper*). So this may be an unrelated bug. Plus, I just changed a few
things from a filesystem point of view about the RRD files and am now
encountering such weirdness as having one entire data source (a two-node
test cluster, whoopty-doo) with blank data. Yet the RRD files'
last-modified timestamp's being updated every minute...
Running gmetad with debug output on for an hour with this many
hosts/metrics isn't really an option. That'd be a fricken' huge logfile.
It's worth noting that my production gmetad is 2.5.0, and my
timestamp-modified version is off the 2.5.2 release codebase. Next step if
this keeps up is to put together a patch and apply it to 2.5.0 gmetad and
see if it exhibits the same behavior...
Although the only real difference in gmetad from 2.5.0 to 2.5.2 is the
inclusion of Fed's grid logic, if I'm not mistaken (Changelog? what's
that?). So I don't know if any of the recent gmetad modifications could
have introduced such wackiness. It's probably just a condition that's
starting to pop up in running gmetad with this many sources, hosts and
metrics on this system...
Bleh.