matt massie wrote:
steve-

your idea is right on. i'm actually embarrassed that i didn't think of that before.
man... if you only have 128 node cluster w/30 standard metrics.. that
means that every 15 seconds you are making 3840 gettimeofday() system calls. ouch! there is also another more subtle problem. the cluster timestamp in the XML was added to handle clusters scattered over different timezones. your method is right on because it will ensure that the time line for metrics matches the real time on the remote clusters (and not the time of the machine collecting the cluster data).

also... your method ensures the data values match exactly the time they are valid... the old method is VERy sensitive to the time of parsing the XML and saving to the databases.

we need to wrap that change up into a release and get that out to the users soon. actually, federico wrapped up a 2.5.2 release and i dropped the ball on getting it the rest of the way out the door. i have been swamped with meeting, conferences and working on ganglia 3 lately... i think it might be smart if i focus back on getting ganglia 2.5.x cleaned up and ready for a good maintenance release. do you have a patch for the changes against the latest CVS source? how has your fix been working for you lately?

I'd been meaning to do a follow-up on this. I'm having a problem with RRDs not being written to over time.

At first I thought, "Oh, crap. I've tampered with rrd_tools.c and this is clearly a case of divine retribution."

So I swapped out gmetad binaries, and the same damn thing started happening, only this time I had real gappy data for about 40-45 minutes and then it stopped entirely (while gmetad still runs).

What I should have done was check the timestamps to see whether the data source threads were dying off entirely or whether it was just a problem updating RRDs, but I didn't do that because it would have made sense and been beneficial from a troubleshooting standpoint and who needs that?

Other people have seemed to have similar problems with various versions of gmetad doing this, if memory serves me correctly (*bites yellow bell pepper*). So this may be an unrelated bug. Plus, I just changed a few things from a filesystem point of view about the RRD files and am now encountering such weirdness as having one entire data source (a two-node test cluster, whoopty-doo) with blank data. Yet the RRD files' last-modified timestamp's being updated every minute...

Running gmetad with debug output on for an hour with this many hosts/metrics isn't really an option. That'd be a fricken' huge logfile.

It's worth noting that my production gmetad is 2.5.0, and my timestamp-modified version is off the 2.5.2 release codebase. Next step if this keeps up is to put together a patch and apply it to 2.5.0 gmetad and see if it exhibits the same behavior...

Although the only real difference in gmetad from 2.5.0 to 2.5.2 is the inclusion of Fed's grid logic, if I'm not mistaken (Changelog? what's that?). So I don't know if any of the recent gmetad modifications could have introduced such wackiness. It's probably just a condition that's starting to pop up in running gmetad with this many sources, hosts and metrics on this system...

Bleh.


Reply via email to