On Fri, May 15, 2009 at 08:42:33AM -0500, Adam Tygart wrote: > On Fri, May 15, 2009 at 04:32, Carlo Marcelo Arenas Belon > <[email protected]> wrote: > > On Thu, May 14, 2009 at 11:47:41AM -0500, Adam Tygart wrote: > > > >> Since then I have been > >> plagued with (what looked like) data errors, mis-reporting swap usage > >> was the easiest to see. > > > > could you elaborate here?, is the value that gmond is collecting on each > > node incorrect?, is the aggregated in gmetad incorrect?, which one of the > > swap metrics is incorrect? > > Aggregate swap data being incorrect is the easiest to see. > Here is the graph from a mis-reporting host (it doesn't always even > send this information): http://imgur.com/io8gu.png > > Here is the resulting aggregate graph: http://imgur.com/trato.png > The beginning of this graph is showing the correct data, I simply > restarted gmond (on all non-webserver hosts), and the resulting swap > usage was from one of them failing to send the correct data.
OK, the metric value is not incorrect, but is not being reported at all which is why you have dips on your graph that fix themselves after several minutes. This is sadly a known issue, because of the way that gmond register metrics dynamically and the fact that some of those metrics aren't refreshed that frequently as described in the Release Notes (mentioning as an example the CPU count issues which is very visible), for more details in the discussion look at : http://www.mail-archive.com/[email protected]/msg04275.html An eventhough I agree it is a bug doesn't have yet a solution, and is not seen unless gmond is restarted (any of them) a workaround is available, but ensuring that if you have to restart a gmond you restart first its "collector" (the one that gmetad is looking at) and the rest are pointing to when using unicast, and restart ALL other gmond in the cluster after that. > >> The question I have is this: is this a known bug? > > > > some are, like the unicast send_metadata_interval or the cpu_count > > inconsistency as shown by the "Important Notes", some others might not be > > I haven't been able to find the "Important Notes" document, is there a > link to this somewhere? sadly it is buried at the bottom of the "Release Notes" now : http://ganglia.wiki.sourceforge.net/ganglia_release_notes and yes I agree should be moved to a better place as well. > Is the cpu_count inconsistency the piece I mentioned about hosts > disappearing from the web interface? most likely the host disappearing from the web interface is because of the send_metadata_interval and you trying to restart the gmond to fix it. if it is not then we have a new bug ;) > >> Is there something else I should try? > > > > rollback to 3.0, specially if you don't need the modules but want a more > > stable setup. > > This being Gentoo, I have no "easy" way of rolling back, as the 3.0.x > builds have been removed from their tree. OK, IMHO having ganglia 3.0 in their tree as well with a different slot might be a good idea, but sadly I haven't yet filed it as a bug or can provide a working ebuild in a public overlay yet as a solution either, but of course you can still do your own binaries/packages if needed. 3.0 is still under development with 3.0.8 going to be released sometime soon and future releases focusing mainly on stability and compatibility with 3.1, as well as supporting all other architectures that are not yet working in 3.1. > The whole reason I upgraded was because I wanted to make use of the > python module support. I was previously using gmetric for monitoring > things like PBS job count and temperature on my nodes. After a week or > two of those scripts running, the load average on the systems started > to climb. After a month, the load average increase caused by gmetrics > was are 2-4 per host. A full 10% of my cluster's CPU utilization was > caused by gmetrics alone (all "system" cpu). most likely the spawn/fork cost and the fact that they were done with too much frequency, 3.1 modules might be a good solution for that, but if the metric collection is expensive anyway (and I would assume it is as I have never seen that much consumption from my own gmetric which are executed every second) then you are not going to solve the problem by just moving that expensive operation into gmond. Carlo ------------------------------------------------------------------------------ Crystal Reports - New Free Runtime and 30 Day Trial Check out the new simplified licensing option that enables unlimited royalty-free distribution of the report engine for externally facing server and web deployment. http://p.sf.net/sfu/businessobjects _______________________________________________ Ganglia-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/ganglia-general

