Re: [Ganglia-developers] CVS checkin

Steven Wagner Fri, 13 Sep 2002 12:40:40 -0700

matt massie wrote:

Today, Steven Wagner wrote forth saying...

although i would like to point out that i have never personally
checked in changes to any of the ganglia projects because of the
firewall configuration here at the company.

so if it *was* "my" fault, it's matt's fault because i send him diffs. :)



heheh.  that's true.. i should have caught it (if it was you).  i don't
know who it was..  it's not really important.. i feel i might have
overreacted a bit.

Maybe so, Mr. Smartypants! (or, to our English friends, "Mr. CleverTrousers!")

why the crazy setup?  i think your homebrew might be dangerous to your
vision. remember: ethyl alcohol good. methyl alcohol bad.


Well, it's not a *stock* version of 2.4.1 on these nodes, so...

[I knew that the code I'd be using would fork at some point but I'm tryingto keep it pretty much in line with releases ... hence anxiety over 2.5.0release :) ]

i've confiscated a solaris box here to test gmetad on solaris.  you can
see the page at

http://hear.millennium.berkeley.edu/gmetad-webfrontend/
hear is an old sun4u (167Mhz) w/a whopping 128mb of memory and runningPHP 3 (so it not working perfectly).
i think i understand some of your old rrd problems (fixed in the latestCVS).your problem is not network related (i don't think)... it's disk i/orelated.

Although several times this week I have been ready to seize upon the firstavailable explanation that "lets me off the hook," I don't buy this one.Otherwise I would have seen equivalent behavior in the Perl gmetad, andI've only added (at most) one host since then. Unless C gmetad is doing a*lot* more disk I/O ...

Of course one surefire way to test this would be to switch the RRD dir to/tmp, as you say. But the log of the new CVS gmetad is indicating deadsources again... (hint: They aren't.)

one problem is that the old version of gmetad was sending string metricsfrom pre-2.5.0 to disk. bad (and now fixed). that was causing some ofthe errors (i think) that you were seeing. you email that showed thatgexec.rrd was getting updated with a "ON" value tipped me to this problem.


Yeah, I noticed that. :)

another problem is that gmetad is very disk intensive. i'm writing myrrds to /tmp on "hear" because writing to /tmp (on solaris) is writing tomemory (it doesn't touch disk). you can see that my data has been solidfor quite a while now.


'Cept for that little notch.  :)

i bet if you install a clean CVS version of gmetad (and delete all yourold rrds on your test machine) you'll find gmetad will work for you. yourother option is to start teaching yourself braille.

Aw man, delete the RRDs *AGAIN* ? I didn't read that part last time. Eh,I'll just stick 'em in /tmp.

gmetad will not write any data to the round-robin databases which is old:either a dead host or stale metric. that is why you are seeing gaps inthe graphs.
actually, there are a couple things causing gappiness.
* first, the possibility of the source being marked dead. this is nolonger happening.
you've taken out a good feature. why have gmetad write data to disk whichis stale? it extra disk i/o you don't need and you are missing helpfulinfo about what machines are up and which machines have tanked.


Er, no, I mean, "this is no longer happening ERRONEOUSLY." :)

Interestingly enough this does still happen erroneously in your puny humangmetad.

* second, the parsing logic apparently causes the entire parser to bombout if it encounters a single stale metric (maybe i should go double-checkthat, but off the top of my head this sounds right).
can you double-check that against the latest CVS?


This seems to have been fixed.  Jolly good show.

* third, the possibility of an error writing to the RRD file exists.
i think the latest CVS will fix this problem. when you look below, forexample, you'll see that gmetad is trying to write the numerical valueOFF to gexec.rrd. silliness. OFF is a string.


Yup, I did notice that but didn't get around to fixing it.

i will work hard the next day or so trying to get you happy on solaris.


Forget it, I'll never be happy.

:P

if we can't then i'll release 2.5.0 with a note.. but i really, reallydon't want to do that. i can see from my crappy solaris web server thatit does work. i'd like to know when it won't. by the way, "Matt Box" inthe display is a 2.5.0 gmond while unspecified is a 2.4.1 gmond. somixing data sources doesn't seem to be the problem.

Near as I can tell, my franken-2.5.0 is fairly well-behaved now. At least,it hasn't been crashing.

for this reason you may want to make several versions of the front-endavailable too...
why would we do that?

Because the new one uses features that require the C version of gmetad,which may not work for everyone. :)

then let's get it out for release soon. let me know how the CVS 2.5.0works for you on solaris. otherwise.. i think we are good to go.

After augmenting gmetad with alien technology, I still see some periodicgappiness, every 5-30 minutes.


OK, it's not really alien technology, I just did a few things to rrd_helpers.c:

*  Moved the snprintf() above the line that assigns val to argv[2].
*  Added in the single-failure sleep/retry code to each RRD_update().

* Out of superstition, I changed the name of the val array in one of theRRD_update() functions. (this may or may not have an effect)


See enclosed diff.

Feel free to change or delete the debug_msg statements, they were a bithelpful in tracking down a program and it's always amusing to see a programsay "rawk." when it recovers from an error. And not just because that'swhat I say when I get the M24 in ArmyOps. :P


Oh!  One more thing!

Running your latest CVS version on my E420R caused an IMMEDIATE bus error.The backtrace indicates that it's the pthread stack setting that causesthis error. I have wrapped a "#ifndef SOLARIS" statement around that linein my version. I'd be interested to know if it's happening to you as well.


Also, what version of Solaris are you running on the "confiscated" box?

Re: [Ganglia-developers] CVS checkin

Reply via email to