Today, Steven Wagner wrote forth saying...

> although i would like to point out that i have never personally
> checked in changes to any of the ganglia projects because of the
> firewall configuration here at the company.
> 
> so if it *was* "my" fault, it's matt's fault because i send him diffs. :) 

heheh.  that's true.. i should have caught it (if it was you).  i don't
know who it was..  it's not really important.. i feel i might have
overreacted a bit.

> > steve, i think this error is a big contributor to the problems that you
> > are seeing on solaris.  if the REPORTED, TN or TMAX flags are not correct
> > then gmetad will not work at all.  heartbeat messages are critical for
> > making it work.
> 
> hmmm, see other e-mails ... i have kind of hacked around this by forcing my 
> 2.5.0 linux homebrew to update timestamps for a host every time it receives 
> a metric.  this ensures a low TN.

why the crazy setup?  i think your homebrew might be dangerous to your
vision.  remember: ethyl alcohol good. methyl alcohol bad. 

i've confiscated a solaris box here to test gmetad on solaris.  you can
see the page at

http://hear.millennium.berkeley.edu/gmetad-webfrontend/

hear is an old sun4u (167Mhz) w/a whopping 128mb of memory and running 
PHP 3 (so it not working perfectly).

i think i understand some of your old rrd problems (fixed in the latest 
CVS).  

your problem is not network related (i don't think)... it's disk i/o 
related.

one problem is that the old version of gmetad was sending string metrics 
from pre-2.5.0 to disk.  bad (and now fixed).  that was causing some of 
the errors (i think) that you were seeing.   you email that showed that 
gexec.rrd was getting updated with a "ON" value tipped me to this problem.

another problem is that gmetad is very disk intensive.  i'm writing my 
rrds to /tmp on "hear" because writing to /tmp (on solaris) is writing to 
memory (it doesn't touch disk).  you can see that my data has been solid 
for quite a while now.

i bet if you install a clean CVS version of gmetad (and delete all your 
old rrds on your test machine) you'll find gmetad will work for you.  your 
other option is to start teaching yourself braille.


> > gmetad will not write any data to the round-robin databases which is old: 
> > either a dead host or stale metric.  that is why you are seeing gaps in 
> > the graphs.
> 
> actually, there are a couple things causing gappiness.
> 
> *  first, the possibility of the source being marked dead.  this is no 
> longer happening.

you've taken out a good feature.  why have gmetad write data to disk which 
is stale?  it extra disk i/o you don't need and you are missing helpful 
info about what machines are up and which machines have tanked.


> *  second, the parsing logic apparently causes the entire parser to bomb 
> out if it encounters a single stale metric (maybe i should go double-check 
> that, but off the top of my head this sounds right).

can you double-check that against the latest CVS? 

> *  third, the possibility of an error writing to the RRD file exists. 

i think the latest CVS will fix this problem.  when you look below, for 
example, you'll see that gmetad is trying to write the numerical value 
OFF to gexec.rrd.  silliness.  OFF is a string.

> RRD_update(): error "expected 1 data source readings (got 0) from
> /www/gmetad/rrds/CLUSTER1/HOST1/gexec.rrd:..." updating
> /www/gmetad/rrds/CLUSTER1/HOST1/gexec.rrd, retrying in 1 second...
> value:  OFF val: N:OFF arg:  
> ,/www/gmetad/rrds/CLUSTER1/HOST1/gexec.rrd,N:OFF. RRD_update(): error
> "expected 1 data source readings (got 0) from
> /www/gmetad/rrds/CLUSTER1/HOST1/gexec.rrd:..." updating
> /www/gmetad/rrds/CLUSTER2/HOST2/cpu_user.rrd, retrying in 1 second...  
> value: 0.1 val: N:0.1 arg:  
> ,/www/gmetad/rrds/CLUSTER2/HOST2/cpu_user.rrd,N:0.1.
>
> i think it is a little optimistic to expect the very first C version of 
> gmetad to be working on all platforms in its initial release.  i suggest 
> releasing 2.5.0 now, pretty much the way it is, with a couple disclaimers 
> in the README ... notably, "this gmetad is new - if you don't like the way 
> it works, use the old one."  er, the old one *will* still work with 2.5.0, 
> right?

i will work hard the next day or so trying to get you happy on solaris.  
if we can't then i'll release 2.5.0 with a note.. but i really, really 
don't want to do that.  i can see from my crappy solaris web server that 
it does work.  i'd like to know when it won't.  by the way, "Matt Box" in 
the display is a 2.5.0 gmond while unspecified is a 2.4.1 gmond.  so 
mixing data sources doesn't seem to be the problem.

> for this reason you may want to make several versions of the front-end 
> available too...

why would we do that?

> anyway, it would get the software into people's hands.  maybe some of 'em 
> will find (and fix) bugs, too. :P
> 
> (i have my own selfish reasons for pushing for a release - i'd rather 
> install 2.5.0 on new systems here than add 'em to the upgrade queue :Q )

then let's get it out for release soon.  let me know how the CVS 2.5.0 
works for you on solaris.  otherwise.. i think we are good to go.

-matt


Reply via email to