"Good morning, Afraid this is going to require a little back story. We were interested in using Ganglia to monitor a few priority systems. Because we were running Debian at the time we just used the 2.5.7-2 Ganglia that is in Debian Sarge (stable) apt repository. The project grew to include many other systems. When we had a hardware failure, we quickly moved the Ganglia monitor to a Mandrake 10 box about a week ago. The project is much larger now and I had all sorts of problems. I went through the email list and fixed almost all of them, but one occurring theme was that I really should update to a newer version as there are many more updates, changes, and fixes. So I did. Probably not the best choice, but whats done is done. While it is important to have running soon, I do have time to rebuild if absolutely necessary.
The Current setup. The system that is running that Ganglia monitor is a Mandrake 10.1 box: Gmetad Web Frontend v2.5.7, Gmetad Web Backend v3.0.3. The nodes are running all running the Gmond 3.0.0 ( a few of the computers are windows and there are many flavors of Linux so I updated to this version across the board in hopes of keeping things on their end all as close as possible ). The Problems. 1) Now at any given time at least half of the computers are reported as dead, even though they are not. Doing a `telnet computer 8649` gives the appropriate data. "Get Fresh Data" will usually change out which nodes are dead and given a 30min cycle most will have switched. 2) Even though this has been running for many hours, some of the "alive" nodes report inaccuracies. Like one node for example "Last heartbeat received -209998 seconds ago" "Uptime -975 days, 16:27:49" "Swap: Using 0.0 of -100Mb" "Booted: January 1, 1970" The inaccuracies change every so often and it will report correctly for a while. Most of those I don't care about but I think it may be a related problem. 3) The "dead" nodes are almost all spot on with their stats, and if you go to the node view and click the "Get Fresh Data" the Load and CPU Utilization do update in sync even though its reported as dead. Maybe I missed the keywords, but I was not able to find anything quite like this in the email archive. I would be very grateful if anyone has any clues as to what maybe going on. Thank you for your time, Chris Stackpole