"Good morning,
Afraid this is going to require a little back story. We were interested in 
using Ganglia to monitor a few priority systems. Because we were running Debian 
at the time we just used the 2.5.7-2 Ganglia that is in Debian Sarge (stable) 
apt repository. The project grew to include many other systems. When we had a 
hardware failure, we quickly moved the Ganglia monitor to a Mandrake 10 box 
about a week ago. The project is much larger now and I had all sorts of 
problems. I went through the email list and fixed almost all of them, but one 
occurring theme was that I really should update to a newer version as there are 
many more updates, changes, and fixes. So I did. Probably not the best choice, 
but whats done is done. While it is important to have running soon, I do have 
time to rebuild if absolutely necessary.

The Current setup.
The system that is running that Ganglia monitor is a Mandrake 10.1 box: Gmetad 
Web Frontend v2.5.7, Gmetad Web Backend v3.0.3.
The nodes are running all running the Gmond 3.0.0 ( a few of the computers are 
windows and there are many flavors of Linux so I updated to this version across 
the board in hopes of keeping things on their end all as close as possible ).

The Problems.
1) Now at any given time at least half of the computers are reported as dead, 
even though they are not. Doing a `telnet computer 8649` gives the appropriate 
data. "Get Fresh Data" will usually change out which nodes are dead and given a 
30min cycle most will have switched.

2) Even though this has been running for many hours, some of the "alive" nodes 
report inaccuracies. Like one node for example
"Last heartbeat received -209998 seconds ago"
"Uptime -975 days, 16:27:49"
"Swap: Using 0.0 of -100Mb"
"Booted: January 1, 1970"
The inaccuracies change every so often and it will report correctly for a 
while. Most of those I don't care about but I think it may be a related problem.

3) The "dead" nodes are almost all spot on with their stats, and if you go to 
the node view and click the "Get Fresh Data" the Load and CPU Utilization do 
update in sync even though its reported as dead.


Maybe I missed the keywords, but I was not able to find anything quite like 
this in the email archive. I would be very grateful if anyone has any clues as 
to what maybe going on.

Thank you for your time,
Chris Stackpole

Reply via email to