Today, Steven Wagner wrote forth saying...
> People have cited disk I/O as a bottleneck. I personally doubt this.
> If it were true, you'd be seeing random gaps whenever RRD updates came
> thick and fast (i.e. while all threads were updating RRDs at once),
> and the failures should be at least a bit more distributed - not just
> in one cluster.
that is not completely true. what are the size of each cluster? if the
cluster you are having problems with is very large then it will have more
disk i/o and be more likely to fail. another thing to think about is the
pipe between the clusters. bare in mind.. this is what causes a cluster
to show up with a gap: gmetad is unable to pull all the XML from the
source AND save the data to RRDs in 30 seconds. if the cluster gmond is
slow or the pipe is slow and gmetad take 15 seconds to even get the data
much less save it to disk then you'll have a gap.
in future releases .. you will be able to arbitrarily set the polling
interval and we will be compressing the XML over the pipe.. both of which
should help.
make sense? (i don't explain things well sometimes)
> I really hope you aren't mixing Linux, FreeBSD and IRIX nodes *WITHIN*
> the same cluster. They support different metric configurations which,
> unfortunately, don't play well with one another, and depending on the
> platform gmetad queries, you may get wildly different results (not to
> mention funky heartbeats). Individual clusters should be as
> homogenous as possible, or you should check out metric.h and make sure
> all the members are using only the metrics they all understand, and in
> that order.
the heartbeats are fine now between versions. you won't have any problems
there. there is metric collisions between solaris and linux however but
not between other operating systems. this will be address in coming
releases. for now..
a solaris box thinks it's ... a linux box thinks it's ...
cpu_wio bytes_out
bread_sec bytes_in
bwrite_sec pkts_in
lread_sec pkts_out
lwrite_sec disk_total
rcache disk_free
wcache part_max_used
phread_sec
phwrite_sec
sooo... a solaris box reporting "cpu_wio" would be saved as "bytes_out"
on a linux box or a linux box reporting its "disk_free" would be saved to
a solaris box as "rcache".
a solaris box reporting "phwrite_sec" would find that all linux boxes are
ignoring the data.
again.. this problem will not exist in the very near future.
> Usually the "smoking gun" error message is in the lines above the
> data_thread() error. But I don't remember whether my spuriously
> verbose debug statements made it into 2.5.0 release, so you may want
> to add some of your own (I could grep for "error" and get most of the
> useful output).
good idea. you can just add lines...
debug_msg("it got here and did this");
anywhere that you want to try and track down the problem.
> So, I have no idea if any of the above helps but maybe it'll give you some
> ideas... and anyway I felt obligated to write since that IRIX port is
> mostly my fault. :P
actually.. a lot of good things in ganglia are steve's fault.
-matt