Not sure if this has anything to do with anything, but I added my
single-node Alpha cluster as a data source and that's currently my only
data source that has no gaps in data collection. My 2.5.0 fileserver
cluster has several of the nodes defined as collection targets, whereas the
2.4.1 cluster has a single "silent partner" node that trusts only the
gmetad box.
The graphs, in decreasing order of stability, are: Alpha, 2.5.0
(multi-source) and 2.4.1 (single-source).
Since nobody has complained of this sort of thing on Linux, I can only
assume that it does not happen on Linux gmetad. Funky Solaris socket
library strikes again? That's the only thing I can think of, because I
know it isn't a network issue. The largest "payload" a source transmits is
the 2.4.1 machine, transmitting a 500k XML feed in under a second. So it's
not Matt's network latency threshold. Perhaps the "nap" code is pushing it
over the top in this case?
Has anyone tried this on a 200+ node setup on Linux and experienced the
same "gappy" behavior?
I came back from lunch to find that gmetad had segfaulted, but hadn't
dumped core (in parsing a metric, it expected one key and got zero). I'll
see if it does it again (all sources appeared to be up...).
Also want to see if 2.4.1 will crash again. That more than anything else
makes me think there's something rotten in the state of the Solaris socket
lib...
[another rambling, conclusionless email...]