i just checked ./gmond/metric.h and the cpu_aidle metric was gone.  i have
to say that i'm very annoyed that that metric was removed without
notification!  i'm sure it was an honest mistake but it's a costly one.  
taking out that metric moved all the metric XDR values and has caused a
lot of grief.  by removing cpu_aidle (which by the way is the percent of
time since boot that the machine has been idle [absolute idle]), 2.4.x and
2.5.x gmonds will not work together at all!  it's okay to add metrics and
extend the range of XDR values (gmond will ignore values it doesn't "know"  
about) but we cannot have developers chopping out metrics which have
already been established.  it shifts the XDR values for every other metric
beneath it!

without cpu_aidle, heartbeat messages on 2.5.0 where given XDR key 25 and 
were being processed like gexec status messages (which is a string instead 
of a uint32).  it's no wonder the multicast threads were not happy when 
2.4.x heartbeat were being processed.. and it's no wonder the heartbeat 
messages were creating crazy REPORTED, TN, TMAX values!

for now (until ganglia3) XDR key = 26 is heartbeat.  regardless of
platform or version.. 26 is life.  in ganglia3 we'll address the XDR
nightmare but ..for now.. if we have agreement that 26 is a heartbeat
message we'll be okay.

i want to clarify how gmond XDR keys work to avoid this problem in the 
future.

in ./gmond/metric.h is a metric_t array which lists IN ORDER every metric 
that gmond processes.  if the order is ever broken.. compatibility 
between version of gmond is broken.  if gmond gets an XDR key that is 
outside of the range it knows about... it will ignore the message.    

in ./gmond/key_metrics.h is an enum which HAS TO EXACTLY MATCH the 
metric_t array in ./gmond/metric.h.  whoever took out cpu_aidle .. took it 
out of metric.h but left it in ./gmond/key_metrics.h making the problem 
even worse.

please be specific about what changes you are making to the monitor-core
when you check things in and if you have any questions at all about the
monitor-core source... please e-mail me or ask.

i've checked in the changes to 2.5.0 to make it happy with 2.4.x so you 
need to cvs update.

steve, i think this error is a big contributor to the problems that you
are seeing on solaris.  if the REPORTED, TN or TMAX flags are not correct
then gmetad will not work at all.  heartbeat messages are critical for
making it work.

gmetad will not write any data to the round-robin databases which is old: 
either a dead host or stale metric.  that is why you are seeing gaps in 
the graphs.

please try the new 2.5.0 source and let me know if solaris is a little 
happier now.

preston, is the freebsd world happy with 2.5.0?  i've compiled the source 
on the sourceforge compile farm but i'm unable to run it.  does it work 
for you?

i really want 2.5.0 out the door soon.  2.5.0 is very solid on linux right 
now and i think it's near time to release it.

-matt










Reply via email to