leif-
i've been wanting to have a way to implement an active alerting mechanism
for a while. the development team would love some help if you're willing
to donate a little time.
i have an idea for a quick and smart hack (i think). gmetad is already
doing the hardest part of this work.
here's the trick...
on the machine running gmetad you'll find all the round-robin databases in
/var/lib/ganglia/rrds (by default) in a nice hierarchy which can be used
to query the information that gmetad has stored. the hierarchy looks like
this...
- root (most likely /var/lib/ganglia/rrds)
|
+-- __SummaryInfo__
| |
| + Metric foo.rrd
| + Metric bar.rrd
| ...
|
+-- Cluster 1
| |
| +---- __SummaryInfo__
| | |
| | + Metric foo.rrd
| | + Metric bar.rrd
| | +
| |
| +---- Host a on Cluster 1
| | |
| | + Metric foo.rrd
| | + Metric bar.rrd
| | ...
| |
| +---- Host b on Cluster 1
| |
| + Metric foo.rrd
| + Metric bar.rrd
| ...
+-- Cluster 2
... etc etc etc
the "__SummaryInfo__" directory are your friend because they contain the
summary information for each metric and each level (grid, cluster and
host).
if you just do a "find ." in /var/lib/ganglia/rrds you'll see what i mean.
how do you get the data out of the round-robin databases? using rrdtool.
here is a walk through on one of the Millennium monitoring machines
# cd /var/lib/ganglia/rrds
# ls
CITRIS Pilot Cluster
Millennium Cluster
OceanStore
WebServer Cluster
__SummaryInfo__
# cd Citris\ Pilot\ Cluster/
# ls
grapefruit.Millennium.berkeley.edu
lime.Millennium.Berkeley.EDU
lemon.Millennium.berkeley.edu
orange.Millennium.berkeley.edu
__SummaryInfo__
# cd orange.Millennium.berkeley.edu
# ls
bytes_in.rrd cpu_nice.rrd disk_free.rrd load_one.rrd
mem_shared.rrd pkts_out.rrd
bytes_out.rrd cpu_num.rrd disk_total.rrd mem_buffers.rrd
mem_total.rrd proc_run.rrd
cpu_aidle.rrd cpu_system.rrd load_fifteen.rrd mem_cached.rrd
part_max_used.rrd proc_total.rrd
cpu_idle.rrd cpu_user.rrd load_five.rrd mem_free.rrd
pkts_in.rrd swap_free.rrd
say we want to monitor a specific metric (say cpu_user) on this specific
host (orange). to get the data all we have to do is use rrdtool
# date '+now is %s'
now is 1034101015
# rrdtool fetch ./cpu_user.rrd AVERAGE -s N-60
sum
1034100945: 0.0000000000e+00
1034100960: 3.7333333333e-01
1034100975: 7.0000000000e-01
1034100990: 7.0000000000e-01
1034101005: 7.0000000000e-01
1034101020: nan
the first command is just to let you see the timestamp of when i ran this.
the rrdtool command is simple and gives you a nice table of recent values
(N-60 means now minus 60 seconds so the data is over the last 60 seconds).
the first column is the timestamp when the data was put into the database
and the second column (after the ':' delimiter) is the value inserted.
the Data Source (DS) name is "sum" which you see at the top.
important note:
the __SummaryInfo__ databases have 2 Data Sources "sum" and "num". the
"num" datasource is the number of hosts which were added together to get
the "sum". it allows you to easily get averages (just divide sum by num).
let's open up a __SummaryInfo__ database now...
# cd ..
# pwd
/var/lib/ganglia/rrds/Citris Pilot Cluster
# ls
grapefruit.Millennium.berkeley.edu lime.Millennium.Berkeley.EDU
__SummaryInfo__
lemon.Millennium.berkeley.edu orange.Millennium.berkeley.edu
# cd __SummaryInfo__
# date '+now is %s'
now is 1034101477
# rrdtool fetch ./cpu_user.rrd AVERAGE -s N-60
sum num
1034101410: 1.3000000000e+00 3.0000000000e+00
1034101425: 9.0000000000e-01 3.0000000000e+00
1034101440: 1.0000000000e-01 3.0000000000e+00
1034101455: 2.5000000000e+00 3.0000000000e+00
1034101470: 2.5000000000e+00 3.0000000000e+00
1034101485: nan nan
the commandline for getting the data is exactly the same but you get back
a second column which (in this case) tells us that the value from three
hosts where added together to get the value "sum".
make sense?
the nice thing about using the round-robin databases is that you have a
strick hierarchical directory structure which allows you to key in to
specific things i want to monitor. it's also a good solution because
gmetad has done all the summary work for you.
you could write an alert system using simple scripting (bourne, perl,
python, et al).
i would like someone to step up to the plate on this. it a great feature
that is just dying to be born (?). just let me know if you (plural ..
meaning Leif or someone on the development team .. or both) want to take
ownership of this. i'll help all i can.. we have a great group of
developers too who would be invaluable in building this.
btw, i'm making good progress building the hierarchical tree data
structure which will be the core of ganglia 3. it's basically going to be
a super-fast in-memory file system with read-write locking for
concurrency. it also has arbitrary depth so we can be creative with the
namespace.
-matt
Today, Leif Nixon wrote forth saying...
> Steven Wagner <[EMAIL PROTECTED]> writes:
>
> > Leif Nixon wrote:
> > > Steven Wagner <[EMAIL PROTECTED]> writes:
> > > Yes, that's what I did last week. It ain't no fun. Nagios' handling
> > > of passive service checks isn't flexible enough. And passive host
> > > checking Just Isn't Done.
> >
> > Once again, considering you have the source at your disposal, I'm sure
> > you could work something out. Spackling in passive host checking is
> > easier than some of the alternatives. :)
>
> I'm not sure about that. Cue Ethan Galstad, Nagios' creator:
>
> "I am investigating the possibility of adding passive host checks in
> 2.0. However, allowing passive checks opens a whole new can of worms
> as far as host check logic is concerned. For instance, if a host is
> reported (passively) as being down (it was previously up), what
> should happen with child hosts? Should those be actively checked
> according to the current tree traversal logic? Also, host checks are
> performed on-demand only (synchronously), so how do you handle
> asynchronous results? Host checks also get priority over pending
> passive service check results, so that has to be figured out.
>
> Anyway, it isn't exactly trivial without changing a good portion of
> how the host check logic works. I'll be looking into it though..."
>
> I don't think I want to dive that deep into Nagios just to make it do
> something it really isn't designed to do.
>
> > > Well, each metric could certainly come with default thresholds,
> > > and if you use some inheritance mechanism you could rather easily
> > > specify thresholds for all your cluster nodes:
> >
> > In a per-node model you have to distribute the new config file to n
> > nodes every time you change something. Which is kind of a bummer,
> > since (as I mentioned before) it seems that there's always an initial
> > tweaking period with notifying mechanisms where you're changing the
> > config every five minutes.
>
> I'm not sure I see the point in distributing the threshold information.
> As you said, the actual notifications will be issued from a central
> host, so why not just keep the threshold configuration there?
>
> > > That way, you only need to specify any exceptions from the defaults.
> > > Whooshy enough?
> >
> > The mental image I was actually going for was the loading program from
> > The Matrix, substituting endless streams of configuration directives
> > for racks o' firearms...
>
> Yes, obviously. So I showed how a few configuration lines (cut to
> Tank, rapidly typing) could specify load thresholds for an entire
> metacluster (WHOOSH). 8^)
>
> > *That* stuff needs to be in gmetad (or a program that fills the same
> > niche, querying one or more metadaemons or monitoring cores, chewing
> > on the XML data, and doing something with it). Flap thresholds,
> > contact info, etc., etc., etc. ...
> >
> > Sounds like you're volunteering to write it. :P
>
> Here I was, hoping I could inspire someone else. 8^)
>
>