Re: [Ganglia-developers] One more question

Steven Wagner Fri, 06 Dec 2002 13:40:55 -0800

IMO, if you are *really* super-concerned with data integrity for some sortof alert system, bolting it onto gmetad doesn't seem like the bestsolution. For starters, gmetad's XML snapshot is the only thing that'scurrent and easily-accessible (IMO). If you want more reliability, have itquery a cluster node and boostrap a node list off of that and poll themonitoring cores. Or, better yet, don't poll the monitoring cores at all -listen for the XDR packets and maintain your own copy of the data inwhatever structure suits your purposes.

I think that's the only way you'd be assured of having the most currentinformation. Remember that a monitoring core never announces it's goingdown, it just doesn't respond. gmetad polls, parses and writes RRDs every20 (+/- random value) seconds. If your notifier polls gmetad and touchesRRDs at a similar rate, you're doubling the load on the front-end/notifier box.

For a notifier more sophisticated than "host up/down" and basic thresholdtype of stuff, I would personally not feel comfortable relying on gmetad.But maybe that's because I've been hammering on this thing since it waswritten in perl. :P

If you enter the way-back machine and read some of the old comments on thislist you can pretty much guess what my ideal notifier looks like. It lookslike a listen-only monitoring core, and its only mandatory configdirectives are of the "what scripts do I run/people do I e-mail for whichevents" variety. The supporting values for each metric should be enoughfor a notifier agent to have sensible threshold defaults for everything,without tinkering with an additional config file for it. Although thatcould of course be an option.


Hmmm, if we were building a big bad servlet, that could be just another thread.

Hmmmmmmmm...

[EMAIL PROTECTED] wrote:

Frederico, thanks again for your insight. My question about gmetad beingup to date became irrelevant in the context of your answer about gmetadnot catching up. Since it only ever pulls the latest cluster state anddoesn't try to fill in the gaps, it is always up to date, as you pointedout.
Has any thought been given to making the tracking over time featurerobust like ?
I'm asking in the context of a project we're considering to bolt analerting daemon onto Ganglia. Originally I thought it would make senseto built it like the web front end. Read the gmetad database and thenscan it for problems. Then I realized that certain kinds of problems,like disk errors are transient. So the case that I'm concerned about isthat a disk error occurs at a moment when gmetad isn't pulling clusterstate, and so it doesn't get noticed. It seems this approach works wellfor metrics that aren't transient, like application or node status. Butfor metrics that are essentially events, another collection methodprobably makes more sense.
We could build an alert engine like gmetad itself. Pull just the latestcluster state from some cooperating gmond and apply condition tests tothat. Any comment on which approach is better?
Is there an open source implementation of LogCaster or TNT Elm typefunctionality for Linux? How do you find out when a disk error occurs?
Jonathan

-----Original Message-----
From: Federico Sacerdoti [mailto:[EMAIL PROTECTED]
Sent: Friday, December 06, 2002 1:22 PM
To: [EMAIL PROTECTED]
Cc: [email protected]
Subject: Re: [Ganglia-developers] One more question


I'll try to answer all of these.

On Thursday, December 5, 2002, at 09:23 PM, [EMAIL PROTECTED]
wrote:

 > Frederico & Steven, I really apprceciate your thoughts about the
 > Ganglia front-end architecture.
 >
 > I have one more question. Is gmetad robust? If I've got this right,
 > gmond maintains only the lastest metric values received for the
 > cluster. If all of the gmetads go down, aren't all the values during
 > that time period lost forever?

If a gmetad goes down, it stops recording metric value history. When it
comes up, this will show as a gap in the graphs.

 > If at least one gmetad stays up, then when others come up and pull the
 > xml description from the gmetad that survived, will they merge all of
 > the values missing from their own rrd?

This does not happen. Gmetad's are not robust the way gmonds are. They
do not attempt to "bring newcomers up to date" as gmond does. This has
to do with security: how do we know you deserve the old data? With
gmond, the security is implicit in being part of the multicast channel.

Also, the rrds are very timestamp sensitive. Even if we did give a
recovering gmetad data for its gaps, small clock skews would make the
graphs look terrible. Not that this isn't something we could overcome
with careful engineering, however. Our assumption is that gmetads are
running on dedicated monitoring hardware that is hand administered and
possibly redundant. If a gmetad goes down, an operator can copy the rrd
files from a surviving gmetad to fill in the gaps. However in practice,
gaps are not that big of a deal, and don't degrade performance or
correctness like a gmond failure does.

 > If so, how do you know at any given time whether a particular gmetad
 > is up to date?

A gmetad always makes graphs based on fresh data. If it is drawing
anything on the left side of a graph, it is up to date. Otherwise it is
dead. If there are gaps in the graph, it means the gmetad was down for
that period of history. I may be misunderstanding your question here.

 > What advice would you give in terms of the gmetad to gmond ratio? For
 > maximum redunancy, should every node run both gmond and gmetad?

Since keeping metric history with RRD databases is computation and I/O
intensive, I would not suggest this. We keep a gmetad service running
on the frontend node of a cluster, that is one gmetad for the cluster.
 >
 > Jonathan
 >
 >

Hope this helps,
Federico

Rocks Cluster Group, Camp X-Ray, SDSC, San Diego
GPG Fingerprint: 3C5E 47E7 BDF8 C14E ED92  92BB BA86 B2E6 0390 8845

Re: [Ganglia-developers] One more question

Reply via email to