If memory serves me correctly, the heartbeat metric was not added until midway through our long CVS-only push from 2.4.1 to 2.5.0. Before this implementation, it was difficult to really be sure whether a node was down or had just randomly decided to wait more than 20-30 seconds to transmit a metric. Gmetad *really* had a problem with this and often reported hosts as down that actually weren't (only for my Solaris cluster, when I was running 2.3.1b1 on Solaris). It is also probably a good idea to add in some sort of retry function to your notifier, that will reduce your false positives (and it means you get notified after 40-60 seconds instead of 20-40 ... worth it if it means your two-way pager doesn't go off at 2am!).

2.5.0 is coming Real Soon Now, and is an improvement. The CVS version appears to be stable so far.

An alternative to going CVS (I rashly assume you haven't :) ) would be altering the gstat logic to increase the "dead threshold," which is probably set to something like 20 seconds. Triple it and see what happens...

Hope this helps...

Kevin James Flasch wrote:
Hello,

I've been having a problem with the gstat program that is part of the Ganglia
Monitoring Core. I have ganglia set up on a cluster of roughly 300 nodes. I
was looking into the possibility of using ganglia for heartbeat monitoring for
the cluster (a way to be notified immiediately if a node goes down). Since
gangalia routinely gathers information about its nodes, it seemed reasonable
to use.

Checking the status of `gstat --dead` seems to give me the information I want
- it will tell me when hosts are down. However, it seems to have *many* false
positives. For example, I ran `gstat --dead` every minute for about 18 hours
and got 102 reports of machines down (many reports telling me multiple
machines were down). No machines went down during this time. The cluster was
not under what we would consider a considerable load, either.

Does anyone have any ideas why gstat is so unreliable? Is there some timeout
factor that might give more reasonable results?

Thanks a lot for your time!

Kevin Flasch



-------------------------------------------------------
This sf.net email is sponsored by: Jabber - The world's fastest growing real-time communications platform! Don't just IM. Build it in! http://www.jabber.com/osdn/xim
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general




Reply via email to