matt massie wrote:
i just checked ./gmond/metric.h and the cpu_aidle metric was gone.

i wonder if this is a result of one of my patches. for all i know, i may have removed it a while back without even thinking about it, because of the four platforms i deal with, only ONE supports it. :) honestly, i have no idea. if that change did somehow sneak into the CVS tree i apologize.
(steve knows the value of the metric hash :) )

although i would like to point out that i have never personally checked in changes to any of the ganglia projects because of the firewall configuration here at the company.

so if it *was* "my" fault, it's matt's fault because i send him diffs. :) i am usually pretty careful and go through the diffs by hand to check out the changes i am submitting, and i don't remember seeing anything like this in there.

steve, i think this error is a big contributor to the problems that you
are seeing on solaris.  if the REPORTED, TN or TMAX flags are not correct
then gmetad will not work at all.  heartbeat messages are critical for
making it work.

hmmm, see other e-mails ... i have kind of hacked around this by forcing my 2.5.0 linux homebrew to update timestamps for a host every time it receives a metric. this ensures a low TN.

gmetad will not write any data to the round-robin databases which is old: either a dead host or stale metric. that is why you are seeing gaps in the graphs.

actually, there are a couple things causing gappiness.

* first, the possibility of the source being marked dead. this is no longer happening. * second, the parsing logic apparently causes the entire parser to bomb out if it encounters a single stale metric (maybe i should go double-check that, but off the top of my head this sounds right). * third, the possibility of an error writing to the RRD file exists. i was getting numerous odd error messages earlier and had put in a retry mechanism to address the problem. although this did address the symptom, it did not fix the cause. the cause turned out to be that val was being assigned to argv[2] before it was formatted with snprintf(). since i moved the snprintf() line above argv[2]'s assignment in summary_RRD_update() and RRD_update() in rrd_helpers.c, the number of weird rrd_update() errors has dropped considerably. i am still running into collisions where two threads will invoke rrd_update() at the same time. this is apparently a problem, as thread #1 will get what may be a valid error back, and thread #2 will read in the same error string. so i get stuff like this:

RRD_update(): error "expected 1 data source readings (got 0) from /www/gmetad/rrds/CLUSTER1/HOST1/gexec.rrd:..." updating /www/gmetad/rrds/CLUSTER1/HOST1/gexec.rrd, retrying in 1 second... value: OFF val: N:OFF arg: ,/www/gmetad/rrds/CLUSTER1/HOST1/gexec.rrd,N:OFF. RRD_update(): error "expected 1 data source readings (got 0) from /www/gmetad/rrds/CLUSTER1/HOST1/gexec.rrd:..." updating /www/gmetad/rrds/CLUSTER2/HOST2/cpu_user.rrd, retrying in 1 second... value: 0.1 val: N:0.1 arg: ,/www/gmetad/rrds/CLUSTER2/HOST2/cpu_user.rrd,N:0.1.

at this moment, i get:

*  no "dead" notices.
*  skip notices only for cpu_wio being a stale metric (see below).
* about 35 or so (they come in clumps, but every few minutes) "expected 1 data source readings but got 0" rrd_update() errors, both for summaries and hosts. * segfaults periodically. linux gmond doesn't seem to be going down with it anymore, which is good. i suspect one of my debug statements may be the cause, not any "real" program code, but i haven't been able to prove this because the segfaults seem to happen at random and it leaves me no core. :(

please try the new 2.5.0 source and let me know if solaris is a little happier now.

actually, i need to set the thresholds for cpu_wio (solaris-only) down to the same levels as cpu_idle, etc. ... for some reason it's at, like, once an hour max. and that ain't cool.

solaris gmond has been running OK for me. it's solaris gmetad which i am locked in mortal combat with (that's combat with a "c", kids).

i really want 2.5.0 out the door soon. 2.5.0 is very solid on linux right now and i think it's near time to release it.

i think it is a little optimistic to expect the very first C version of gmetad to be working on all platforms in its initial release. i suggest releasing 2.5.0 now, pretty much the way it is, with a couple disclaimers in the README ... notably, "this gmetad is new - if you don't like the way it works, use the old one." er, the old one *will* still work with 2.5.0, right?

for this reason you may want to make several versions of the front-end available too...

anyway, it would get the software into people's hands. maybe some of 'em will find (and fix) bugs, too. :P

(i have my own selfish reasons for pushing for a release - i'd rather install 2.5.0 on new systems here than add 'em to the upgrade queue :Q )


Reply via email to