Not sure if this has anything to do with anything, but I added my single-node Alpha cluster as a data source and that's currently my only data source that has no gaps in data collection. My 2.5.0 fileserver cluster has several of the nodes defined as collection targets, whereas the 2.4.1 cluster has a single "silent partner" node that trusts only the gmetad box.

The graphs, in decreasing order of stability, are: Alpha, 2.5.0 (multi-source) and 2.4.1 (single-source).

Since nobody has complained of this sort of thing on Linux, I can only assume that it does not happen on Linux gmetad. Funky Solaris socket library strikes again? That's the only thing I can think of, because I know it isn't a network issue. The largest "payload" a source transmits is the 2.4.1 machine, transmitting a 500k XML feed in under a second. So it's not Matt's network latency threshold. Perhaps the "nap" code is pushing it over the top in this case?

Has anyone tried this on a 200+ node setup on Linux and experienced the same "gappy" behavior?

I came back from lunch to find that gmetad had segfaulted, but hadn't dumped core (in parsing a metric, it expected one key and got zero). I'll see if it does it again (all sources appeared to be up...).

Also want to see if 2.4.1 will crash again. That more than anything else makes me think there's something rotten in the state of the Solaris socket lib...

[another rambling, conclusionless email...]


Reply via email to