Hello,

Thanks a lot for your input! My problem appears to be fixed at the moment.
I'll comment on it where appropriate below.

On Wed, 28 Aug 2002, matt massie wrote:

> kevin-
>
> here is the reason why that is happening.
>
> when gmond gets a request for XML it walks its underlying cluster hash
> which is multithreaded with read/ write locking.  it walks the hash as a
> reader.
>
> at the start of the XML output, it does a gettimeofday() and places a
> timestamp in the <CLUSTER LOCALTIME="x"> attribute.
>
> if by some chance a multicast packet comes in at the same time gmond is
> sending the data in the hash as XML, then the REPORTED timestamp for that
> HOST gets updated to have a value which is NEWER than the LOCALTIME.
>
> here is how to check if you have the problem...
>
> look in ./lib/gexec_funcs.c at line 64 ...
>
>     VVV
> if( abs(cluster->localtime - cluster->host->last_reported) < GEXEC_TIMEOUT )
>     ^^^
>             {
>                cluster->host_up = 1;
>             }
>          else
>             {
>                cluster->host_up = 0;
>             }
>
> it is important that the difference is being calculated as an absoluted
> value since a negative number would make the host appear as if it were
> down when it's not.  if your subtraction is not inside of abs() then
> simply add it and you'll see your problem disappear.

This seemed to be the problem. The version of the monitoring core I was using
did not have the "abs" function on that line. I believe it was 2.4.0. I
updated to 2.4.1-1, where the "abs" is in gexec_funcs.c, and I have not
noticed the problem I reported since.

In about 9 days time, there have been no false reports.

Thank you! I very much appreciated your speedy response. Thanks for all the
great work with ganglia.

Kevin Flasch

> Today, Kevin James Flasch wrote forth saying...
>
> >
> > Hello,
> >
> > I've been having a problem with the gstat program that is part of the 
> > Ganglia
> > Monitoring Core. I have ganglia set up on a cluster of roughly 300 nodes. I
> > was looking into the possibility of using ganglia for heartbeat monitoring 
> > for
> > the cluster (a way to be notified immiediately if a node goes down). Since
> > gangalia routinely gathers information about its nodes, it seemed reasonable
> > to use.
> >
> > Checking the status of `gstat --dead` seems to give me the information I 
> > want
> > - it will tell me when hosts are down. However, it seems to have *many* 
> > false
> > positives. For example, I ran `gstat --dead` every minute for about 18 hours
> > and got 102 reports of machines down (many reports telling me multiple
> > machines were down). No machines went down during this time. The cluster was
> > not under what we would consider a considerable load, either.
> >
> > Does anyone have any ideas why gstat is so unreliable? Is there some timeout
> > factor that might give more reasonable results?
> >
> > Thanks a lot for your time!
> >
> > Kevin Flasch

   Kevin James Flasch
   http://www.uwm.edu/~kflasch/kflasch.gpg



Reply via email to