kevin-
here is the reason why that is happening.
when gmond gets a request for XML it walks its underlying cluster hash
which is multithreaded with read/ write locking. it walks the hash as a
reader.
at the start of the XML output, it does a gettimeofday() and places a
timestamp in the <CLUSTER LOCALTIME="x"> attribute.
if by some chance a multicast packet comes in at the same time gmond is
sending the data in the hash as XML, then the REPORTED timestamp for that
HOST gets updated to have a value which is NEWER than the LOCALTIME.
here is how to check if you have the problem...
look in ./lib/gexec_funcs.c at line 64 ...
VVV
if( abs(cluster->localtime - cluster->host->last_reported) < GEXEC_TIMEOUT )
^^^
{
cluster->host_up = 1;
}
else
{
cluster->host_up = 0;
}
it is important that the difference is being calculated as an absoluted
value since a negative number would make the host appear as if it were
down when it's not. if your subtraction is not inside of abs() then
simply add it and you'll see your problem disappear.
good luck!
-matt
Today, Kevin James Flasch wrote forth saying...
>
> Hello,
>
> I've been having a problem with the gstat program that is part of the Ganglia
> Monitoring Core. I have ganglia set up on a cluster of roughly 300 nodes. I
> was looking into the possibility of using ganglia for heartbeat monitoring for
> the cluster (a way to be notified immiediately if a node goes down). Since
> gangalia routinely gathers information about its nodes, it seemed reasonable
> to use.
>
> Checking the status of `gstat --dead` seems to give me the information I want
> - it will tell me when hosts are down. However, it seems to have *many* false
> positives. For example, I ran `gstat --dead` every minute for about 18 hours
> and got 102 reports of machines down (many reports telling me multiple
> machines were down). No machines went down during this time. The cluster was
> not under what we would consider a considerable load, either.
>
> Does anyone have any ideas why gstat is so unreliable? Is there some timeout
> factor that might give more reasonable results?
>
> Thanks a lot for your time!
>
> Kevin Flasch
>
>
>
> -------------------------------------------------------
> This sf.net email is sponsored by: Jabber - The world's fastest growing
> real-time communications platform! Don't just IM. Build it in!
> http://www.jabber.com/osdn/xim
> _______________________________________________
> Ganglia-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>