Hello, I've been having a problem with the gstat program that is part of the Ganglia Monitoring Core. I have ganglia set up on a cluster of roughly 300 nodes. I was looking into the possibility of using ganglia for heartbeat monitoring for the cluster (a way to be notified immiediately if a node goes down). Since gangalia routinely gathers information about its nodes, it seemed reasonable to use.
Checking the status of `gstat --dead` seems to give me the information I want - it will tell me when hosts are down. However, it seems to have *many* false positives. For example, I ran `gstat --dead` every minute for about 18 hours and got 102 reports of machines down (many reports telling me multiple machines were down). No machines went down during this time. The cluster was not under what we would consider a considerable load, either. Does anyone have any ideas why gstat is so unreliable? Is there some timeout factor that might give more reasonable results? Thanks a lot for your time! Kevin Flasch

