Re: [Ganglia-developers] CVS checkin

Steven Wagner Thu, 12 Sep 2002 21:24:33 -0700

matt massie wrote:

i just checked ./gmond/metric.h and the cpu_aidle metric was gone.

i wonder if this is a result of one of my patches. for all i know, i mayhave removed it a while back without even thinking about it, because of thefour platforms i deal with, only ONE supports it. :) honestly, i have noidea. if that change did somehow sneak into the CVS tree i apologize.

(steve knows the value of the metric hash :) )

although i would like to point out that i have never personally checked inchanges to any of the ganglia projects because of the firewallconfiguration here at the company.

so if it *was* "my" fault, it's matt's fault because i send him diffs. :)i am usually pretty careful and go through the diffs by hand to check outthe changes i am submitting, and i don't remember seeing anything like thisin there.

steve, i think this error is a big contributor to the problems that you
are seeing on solaris.  if the REPORTED, TN or TMAX flags are not correct
then gmetad will not work at all.  heartbeat messages are critical for
making it work.

hmmm, see other e-mails ... i have kind of hacked around this by forcing my2.5.0 linux homebrew to update timestamps for a host every time it receivesa metric. this ensures a low TN.

gmetad will not write any data to the round-robin databases which is old:either a dead host or stale metric. that is why you are seeing gaps inthe graphs.


actually, there are a couple things causing gappiness.

* first, the possibility of the source being marked dead. this is nolonger happening.* second, the parsing logic apparently causes the entire parser to bombout if it encounters a single stale metric (maybe i should go double-checkthat, but off the top of my head this sounds right).* third, the possibility of an error writing to the RRD file exists. iwas getting numerous odd error messages earlier and had put in a retrymechanism to address the problem. although this did address the symptom,it did not fix the cause. the cause turned out to be that val was beingassigned to argv[2] before it was formatted with snprintf(). since i movedthe snprintf() line above argv[2]'s assignment in summary_RRD_update() andRRD_update() in rrd_helpers.c, the number of weird rrd_update() errors hasdropped considerably. i am still running into collisions where two threadswill invoke rrd_update() at the same time. this is apparently a problem,as thread #1 will get what may be a valid error back, and thread #2 willread in the same error string. so i get stuff like this:

RRD_update(): error "expected 1 data source readings (got 0) from/www/gmetad/rrds/CLUSTER1/HOST1/gexec.rrd:..." updating/www/gmetad/rrds/CLUSTER1/HOST1/gexec.rrd, retrying in 1 second... value:OFF val: N:OFF arg: ,/www/gmetad/rrds/CLUSTER1/HOST1/gexec.rrd,N:OFF.RRD_update(): error "expected 1 data source readings (got 0) from/www/gmetad/rrds/CLUSTER1/HOST1/gexec.rrd:..." updating/www/gmetad/rrds/CLUSTER2/HOST2/cpu_user.rrd, retrying in 1 second...value: 0.1 val: N:0.1 arg:,/www/gmetad/rrds/CLUSTER2/HOST2/cpu_user.rrd,N:0.1.


at this moment, i get:

*  no "dead" notices.
*  skip notices only for cpu_wio being a stale metric (see below).

* about 35 or so (they come in clumps, but every few minutes) "expected 1data source readings but got 0" rrd_update() errors, both for summaries andhosts.* segfaults periodically. linux gmond doesn't seem to be going down withit anymore, which is good. i suspect one of my debug statements may be thecause, not any "real" program code, but i haven't been able to prove thisbecause the segfaults seem to happen at random and it leaves me no core. :(

please try the new 2.5.0 source and let me know if solaris is a littlehappier now.

actually, i need to set the thresholds for cpu_wio (solaris-only) down tothe same levels as cpu_idle, etc. ... for some reason it's at, like, oncean hour max. and that ain't cool.

solaris gmond has been running OK for me. it's solaris gmetad which i amlocked in mortal combat with (that's combat with a "c", kids).

i really want 2.5.0 out the door soon. 2.5.0 is very solid on linux rightnow and i think it's near time to release it.

i think it is a little optimistic to expect the very first C version ofgmetad to be working on all platforms in its initial release. i suggestreleasing 2.5.0 now, pretty much the way it is, with a couple disclaimersin the README ... notably, "this gmetad is new - if you don't like the wayit works, use the old one." er, the old one *will* still work with 2.5.0,right?

for this reason you may want to make several versions of the front-endavailable too...

anyway, it would get the software into people's hands. maybe some of 'emwill find (and fix) bugs, too. :P

(i have my own selfish reasons for pushing for a release - i'd ratherinstall 2.5.0 on new systems here than add 'em to the upgrade queue :Q )

Re: [Ganglia-developers] CVS checkin

Reply via email to