I am not sure if this is related, but ganglia doesn't seem to behave very well if a whole cluster either stops reporting or is removed. The rrds no longer get updated and gmetad still keeps a copy of the xml data from the last time it got it from the cluster, but the webfrontend makes it appear like the whole cluster is still up and running. It even says the last heartbeat was only a few seconds ago. The only clue that something is wrong are the empty graphs.
I first noticed this when we installed ganglia on a new cluster, then removed it a few days later. I expected the webfrontend to show the entire cluster as dead, but it didn't. This could be dangerous for example if you are using ganglia to monitor your cluster and have some kind of network failure in the part of your cluster that is defined as the data_source for gmetad, or those nodes just die themselves. Except for the graphs it will still look like it is up and reporting when it really isn't. I haven't had the time to investigate this more, but there must be some sort of bug in the webfrontend scripts that make it appear that the nodes are still up and running, and were even heard from a few seconds ago. What about gmetad though, should it expire any of its data if it hasn't been updated after some time or just keep it around so you have to manually restart it if you want to flush out the old cluster's data? ~Jason On Wed, 2003-03-26 at 13:50, Steven Wagner wrote: > matt massie wrote: > > prashant- > > > > so when a node in the cluster dies the cluster size changes but the dead > > node is not reported? > > > > this is a new problem that i haven't heard of before. did gmond get > > restarted after the node failed? ganglia knows the a node dies when it > > stops getting heartbeats from a machine that it previously heard from. if > > gmond is getting restarted somehow it wouldn't know about the dead node > > because it hasn't even received a single heartbeat from it (remember that > > everything in gmond is soft state). > > > > is it possible that your gmond data source was restarted after the node > > died? > > > > i'm sure if we walk through this we'll find the solution to the problem. > > Now that I think about it, I seem to recall this happening to me in one of > the recent (but not current) 2.5.x frontend revisions. There was a bug in > (I believe) ganglia.php which was not incrementing the dead node array. > > I'm pretty sure the reason I didn't respond to the original message was > that he's using the most current version and still gets the same behavior, > so I was stumped. But I just had that idea again and decided to throw it > out there in the hope of it being useful... > > And I know none of the regular readers of this list believe me, but I > really *do* try not to go shooting off my mouth when I have no idea how to > fix the problem... :) > > > > ------------------------------------------------------- > This SF.net email is sponsored by: > The Definitive IT and Networking Event. Be There! > NetWorld+Interop Las Vegas 2003 -- Register today! > http://ads.sourceforge.net/cgi-bin/redirect.pl?keyn0001en > _______________________________________________ > Ganglia-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/ganglia-general -- /------------------------------------------------------------------\ | Jason A. Smith Email: [EMAIL PROTECTED] | | Atlas Computing Facility, Bldg. 510M Phone: (631)344-4226 | | Brookhaven National Lab, P.O. Box 5000 Fax: (631)344-7616 | | Upton, NY 11973-5000 | \------------------------------------------------------------------/

