i just checked ./gmond/metric.h and the cpu_aidle metric was gone. i have to say that i'm very annoyed that that metric was removed without notification! i'm sure it was an honest mistake but it's a costly one. taking out that metric moved all the metric XDR values and has caused a lot of grief. by removing cpu_aidle (which by the way is the percent of time since boot that the machine has been idle [absolute idle]), 2.4.x and 2.5.x gmonds will not work together at all! it's okay to add metrics and extend the range of XDR values (gmond will ignore values it doesn't "know" about) but we cannot have developers chopping out metrics which have already been established. it shifts the XDR values for every other metric beneath it!
without cpu_aidle, heartbeat messages on 2.5.0 where given XDR key 25 and were being processed like gexec status messages (which is a string instead of a uint32). it's no wonder the multicast threads were not happy when 2.4.x heartbeat were being processed.. and it's no wonder the heartbeat messages were creating crazy REPORTED, TN, TMAX values! for now (until ganglia3) XDR key = 26 is heartbeat. regardless of platform or version.. 26 is life. in ganglia3 we'll address the XDR nightmare but ..for now.. if we have agreement that 26 is a heartbeat message we'll be okay. i want to clarify how gmond XDR keys work to avoid this problem in the future. in ./gmond/metric.h is a metric_t array which lists IN ORDER every metric that gmond processes. if the order is ever broken.. compatibility between version of gmond is broken. if gmond gets an XDR key that is outside of the range it knows about... it will ignore the message. in ./gmond/key_metrics.h is an enum which HAS TO EXACTLY MATCH the metric_t array in ./gmond/metric.h. whoever took out cpu_aidle .. took it out of metric.h but left it in ./gmond/key_metrics.h making the problem even worse. please be specific about what changes you are making to the monitor-core when you check things in and if you have any questions at all about the monitor-core source... please e-mail me or ask. i've checked in the changes to 2.5.0 to make it happy with 2.4.x so you need to cvs update. steve, i think this error is a big contributor to the problems that you are seeing on solaris. if the REPORTED, TN or TMAX flags are not correct then gmetad will not work at all. heartbeat messages are critical for making it work. gmetad will not write any data to the round-robin databases which is old: either a dead host or stale metric. that is why you are seeing gaps in the graphs. please try the new 2.5.0 source and let me know if solaris is a little happier now. preston, is the freebsd world happy with 2.5.0? i've compiled the source on the sourceforge compile farm but i'm unable to run it. does it work for you? i really want 2.5.0 out the door soon. 2.5.0 is very solid on linux right now and i think it's near time to release it. -matt
