matt massie wrote:
Today, Steven Wagner wrote forth saying...
although i would like to point out that i have never personally
checked in changes to any of the ganglia projects because of the
firewall configuration here at the company.
so if it *was* "my" fault, it's matt's fault because i send him diffs. :)
heheh. that's true.. i should have caught it (if it was you). i don't
know who it was.. it's not really important.. i feel i might have
overreacted a bit.
Maybe so, Mr. Smartypants! (or, to our English friends, "Mr. Clever
Trousers!")
why the crazy setup? i think your homebrew might be dangerous to your
vision. remember: ethyl alcohol good. methyl alcohol bad.
Well, it's not a *stock* version of 2.4.1 on these nodes, so...
[I knew that the code I'd be using would fork at some point but I'm trying
to keep it pretty much in line with releases ... hence anxiety over 2.5.0
release :) ]
i've confiscated a solaris box here to test gmetad on solaris. you can
see the page at
http://hear.millennium.berkeley.edu/gmetad-webfrontend/
hear is an old sun4u (167Mhz) w/a whopping 128mb of memory and running
PHP 3 (so it not working perfectly).
i think i understand some of your old rrd problems (fixed in the latest
CVS).
your problem is not network related (i don't think)... it's disk i/o
related.
Although several times this week I have been ready to seize upon the first
available explanation that "lets me off the hook," I don't buy this one.
Otherwise I would have seen equivalent behavior in the Perl gmetad, and
I've only added (at most) one host since then. Unless C gmetad is doing a
*lot* more disk I/O ...
Of course one surefire way to test this would be to switch the RRD dir to
/tmp, as you say. But the log of the new CVS gmetad is indicating dead
sources again... (hint: They aren't.)
one problem is that the old version of gmetad was sending string metrics
from pre-2.5.0 to disk. bad (and now fixed). that was causing some of
the errors (i think) that you were seeing. you email that showed that
gexec.rrd was getting updated with a "ON" value tipped me to this problem.
Yeah, I noticed that. :)
another problem is that gmetad is very disk intensive. i'm writing my
rrds to /tmp on "hear" because writing to /tmp (on solaris) is writing to
memory (it doesn't touch disk). you can see that my data has been solid
for quite a while now.
'Cept for that little notch. :)
i bet if you install a clean CVS version of gmetad (and delete all your
old rrds on your test machine) you'll find gmetad will work for you. your
other option is to start teaching yourself braille.
Aw man, delete the RRDs *AGAIN* ? I didn't read that part last time. Eh,
I'll just stick 'em in /tmp.
gmetad will not write any data to the round-robin databases which is old:
either a dead host or stale metric. that is why you are seeing gaps in
the graphs.
actually, there are a couple things causing gappiness.
* first, the possibility of the source being marked dead. this is no
longer happening.
you've taken out a good feature. why have gmetad write data to disk which
is stale? it extra disk i/o you don't need and you are missing helpful
info about what machines are up and which machines have tanked.
Er, no, I mean, "this is no longer happening ERRONEOUSLY." :)
Interestingly enough this does still happen erroneously in your puny human
gmetad.
* second, the parsing logic apparently causes the entire parser to bomb
out if it encounters a single stale metric (maybe i should go double-check
that, but off the top of my head this sounds right).
can you double-check that against the latest CVS?
This seems to have been fixed. Jolly good show.
* third, the possibility of an error writing to the RRD file exists.
i think the latest CVS will fix this problem. when you look below, for
example, you'll see that gmetad is trying to write the numerical value
OFF to gexec.rrd. silliness. OFF is a string.
Yup, I did notice that but didn't get around to fixing it.
i will work hard the next day or so trying to get you happy on solaris.
Forget it, I'll never be happy.
:P
if we can't then i'll release 2.5.0 with a note.. but i really, really
don't want to do that. i can see from my crappy solaris web server that
it does work. i'd like to know when it won't. by the way, "Matt Box" in
the display is a 2.5.0 gmond while unspecified is a 2.4.1 gmond. so
mixing data sources doesn't seem to be the problem.
Near as I can tell, my franken-2.5.0 is fairly well-behaved now. At least,
it hasn't been crashing.
for this reason you may want to make several versions of the front-end
available too...
why would we do that?
Because the new one uses features that require the C version of gmetad,
which may not work for everyone. :)
then let's get it out for release soon. let me know how the CVS 2.5.0
works for you on solaris. otherwise.. i think we are good to go.
After augmenting gmetad with alien technology, I still see some periodic
gappiness, every 5-30 minutes.
OK, it's not really alien technology, I just did a few things to rrd_helpers.c:
* Moved the snprintf() above the line that assigns val to argv[2].
* Added in the single-failure sleep/retry code to each RRD_update().
* Out of superstition, I changed the name of the val array in one of the
RRD_update() functions. (this may or may not have an effect)
See enclosed diff.
Feel free to change or delete the debug_msg statements, they were a bit
helpful in tracking down a program and it's always amusing to see a program
say "rawk." when it recovers from an error. And not just because that's
what I say when I get the M24 in ArmyOps. :P
Oh! One more thing!
Running your latest CVS version on my E420R caused an IMMEDIATE bus error.
The backtrace indicates that it's the pthread stack setting that causes
this error. I have wrapped a "#ifndef SOLARIS" statement around that line
in my version. I'd be interested to know if it's happening to you as well.
Also, what version of Solaris are you running on the "confiscated" box?