Today, Steven Wagner wrote forth saying...

> People have cited disk I/O as a bottleneck.  I personally doubt this.  
> If it were true, you'd be seeing random gaps whenever RRD updates came
> thick and fast (i.e. while all threads were updating RRDs at once),
> and the failures should be at least a bit more distributed - not just
> in one cluster.

that is not completely true.  what are the size of each cluster?  if the 
cluster you are having problems with is very large then it will have more 
disk i/o and be more likely to fail.  another thing to think about is the 
pipe between the clusters.  bare in mind.. this is what causes a cluster 
to show up with a gap: gmetad is unable to pull all the XML from the 
source AND save the data to RRDs in 30 seconds.  if the cluster gmond is 
slow or the pipe is slow and gmetad take 15 seconds to even get the data 
much less save it to disk then you'll have a gap.

in future releases .. you will be able to arbitrarily set the polling 
interval and we will be compressing the XML over the pipe.. both of which 
should help.

make sense?  (i don't explain things well sometimes)

> I really hope you aren't mixing Linux, FreeBSD and IRIX nodes *WITHIN*
> the same cluster.  They support different metric configurations which,
> unfortunately, don't play well with one another, and depending on the
> platform gmetad queries, you may get wildly different results (not to
> mention funky heartbeats).  Individual clusters should be as
> homogenous as possible, or you should check out metric.h and make sure
> all the members are using only the metrics they all understand, and in
> that order.

the heartbeats are fine now between versions.  you won't have any problems
there.  there is metric collisions between solaris and linux however but
not between other operating systems.  this will be address in coming
releases.  for now..

a solaris box thinks it's ...     a linux box thinks it's ...
           
   cpu_wio                        bytes_out
   bread_sec                      bytes_in
   bwrite_sec                     pkts_in
   lread_sec                      pkts_out
   lwrite_sec                     disk_total
   rcache                         disk_free
   wcache                         part_max_used
   phread_sec
   phwrite_sec

sooo...  a solaris box reporting "cpu_wio" would be saved as "bytes_out" 
on a linux box or a linux box reporting its "disk_free" would be saved to 
a solaris box as "rcache".

a solaris box reporting "phwrite_sec" would find that all linux boxes are 
ignoring the data.

again.. this problem will not exist in the very near future.  

> Usually the "smoking gun" error message is in the lines above the
> data_thread() error.  But I don't remember whether my spuriously
> verbose debug statements made it into 2.5.0 release, so you may want
> to add some of your own (I could grep for "error" and get most of the
> useful output).

good idea.  you can just add lines...

debug_msg("it got here and did this");

anywhere that you want to try and track down the problem.

> So, I have no idea if any of the above helps but maybe it'll give you some 
> ideas... and anyway I felt obligated to write since that IRIX port is 
> mostly my fault. :P

actually.. a lot of good things in ganglia are steve's fault.

-matt


Reply via email to