Hello,

I've been having a problem with the gstat program that is part of the Ganglia
Monitoring Core. I have ganglia set up on a cluster of roughly 300 nodes. I
was looking into the possibility of using ganglia for heartbeat monitoring for
the cluster (a way to be notified immiediately if a node goes down). Since
gangalia routinely gathers information about its nodes, it seemed reasonable
to use.

Checking the status of `gstat --dead` seems to give me the information I want
- it will tell me when hosts are down. However, it seems to have *many* false
positives. For example, I ran `gstat --dead` every minute for about 18 hours
and got 102 reports of machines down (many reports telling me multiple
machines were down). No machines went down during this time. The cluster was
not under what we would consider a considerable load, either.

Does anyone have any ideas why gstat is so unreliable? Is there some timeout
factor that might give more reasonable results?

Thanks a lot for your time!

Kevin Flasch


Reply via email to