[Ganglia-developers] Re: [Ganglia-general] Monitoring

matt massie Tue, 08 Oct 2002 11:39:58 -0700

leif-

i've been wanting to have a way to implement an active alerting mechanism
for a while.  the development team would love some help if you're willing
to donate a little time.

i have an idea for a quick and smart hack (i think).  gmetad is already 
doing the hardest part of this work.

here's the trick...

on the machine running gmetad you'll find all the round-robin databases in
/var/lib/ganglia/rrds (by default) in a nice hierarchy which can be used 
to query the information that gmetad has stored.  the hierarchy looks like 
this...

- root (most likely /var/lib/ganglia/rrds)
   |
   +-- __SummaryInfo__
   |        |
   |        + Metric foo.rrd
   |        + Metric bar.rrd
   |        ... 
   |
   +-- Cluster 1
   |        |
   |        +---- __SummaryInfo__
   |        |             |
   |        |             + Metric foo.rrd
   |        |             + Metric bar.rrd
   |        |             + 
   |        |
   |        +---- Host a on Cluster 1
   |        |             |
   |        |             + Metric foo.rrd
   |        |             + Metric bar.rrd
   |        |             ...        
   |        |   
   |        +---- Host b on Cluster 1
   |                      | 
   |                      + Metric foo.rrd
   |                      + Metric bar.rrd
   |                      ...
   +-- Cluster 2
          ...  etc etc etc

the "__SummaryInfo__" directory are your friend because they contain the 
summary information for each metric and each level (grid, cluster and 
host).

if you just do a "find ." in /var/lib/ganglia/rrds you'll see what i mean.

how do you get the data out of the round-robin databases?  using rrdtool.

here is a walk through on one of the Millennium monitoring machines

# cd /var/lib/ganglia/rrds
# ls 
CITRIS Pilot Cluster  
Millennium Cluster  
OceanStore  
WebServer Cluster
__SummaryInfo__  

# cd Citris\ Pilot\ Cluster/
# ls
grapefruit.Millennium.berkeley.edu  
lime.Millennium.Berkeley.EDU    
lemon.Millennium.berkeley.edu       
orange.Millennium.berkeley.edu
__SummaryInfo__

# cd orange.Millennium.berkeley.edu
# ls
bytes_in.rrd   cpu_nice.rrd    disk_free.rrd     load_one.rrd     
mem_shared.rrd     pkts_out.rrd
bytes_out.rrd  cpu_num.rrd     disk_total.rrd    mem_buffers.rrd  
mem_total.rrd      proc_run.rrd
cpu_aidle.rrd  cpu_system.rrd  load_fifteen.rrd  mem_cached.rrd   
part_max_used.rrd  proc_total.rrd
cpu_idle.rrd   cpu_user.rrd    load_five.rrd     mem_free.rrd     
pkts_in.rrd        swap_free.rrd

say we want to monitor a specific metric (say cpu_user) on this specific 
host (orange).  to get the data all we have to do is use rrdtool

# date '+now is %s'
now is 1034101015

# rrdtool fetch ./cpu_user.rrd AVERAGE -s N-60
                      sum

1034100945: 0.0000000000e+00
1034100960: 3.7333333333e-01
1034100975: 7.0000000000e-01
1034100990: 7.0000000000e-01
1034101005: 7.0000000000e-01
1034101020: nan

the first command is just to let you see the timestamp of when i ran this.  
the rrdtool command is simple and gives you a nice table of recent values 
(N-60 means now minus 60 seconds so the data is over the last 60 seconds).
the first column is the timestamp when the data was put into the database 
and the second column (after the ':' delimiter) is the value inserted.  
the Data Source (DS) name is "sum" which you see at the top.  

important note:
the __SummaryInfo__ databases have 2 Data Sources "sum" and "num".  the 
"num" datasource is the number of hosts which were added together to get 
the "sum".  it allows you to easily get averages (just divide sum by num).

let's open up a __SummaryInfo__ database now...

# cd ..
# pwd
/var/lib/ganglia/rrds/Citris Pilot Cluster
# ls
grapefruit.Millennium.berkeley.edu  lime.Millennium.Berkeley.EDU    
__SummaryInfo__
lemon.Millennium.berkeley.edu       orange.Millennium.berkeley.edu
# cd __SummaryInfo__
# date '+now is %s'
now is 1034101477
# rrdtool fetch ./cpu_user.rrd AVERAGE -s N-60
                      sum           num

1034101410: 1.3000000000e+00 3.0000000000e+00
1034101425: 9.0000000000e-01 3.0000000000e+00
1034101440: 1.0000000000e-01 3.0000000000e+00
1034101455: 2.5000000000e+00 3.0000000000e+00
1034101470: 2.5000000000e+00 3.0000000000e+00
1034101485: nan nan

the commandline for getting the data is exactly the same but you get back 
a second column which (in this case) tells us that the value from three 
hosts where added together to get the value "sum".

make sense?

the nice thing about using the round-robin databases is that you have a 
strick hierarchical directory structure which allows you to key in to 
specific things i want to monitor.  it's also a good solution because 
gmetad has done all the summary work for you.

you could write an alert system using simple scripting (bourne, perl, 
python, et al).

i would like someone to step up to the plate on this.  it a great feature
that is just dying to be born (?).  just let me know if you (plural ..  
meaning Leif or someone on the development team .. or both) want to take
ownership of this.  i'll help all i can.. we have a great group of
developers too who would be invaluable in building this.

btw, i'm making good progress building the hierarchical tree data
structure which will be the core of ganglia 3.  it's basically going to be 
a super-fast in-memory file system with read-write locking for 
concurrency.  it also has arbitrary depth so we can be creative with the 
namespace.

-matt

Today, Leif Nixon wrote forth saying...

> Steven Wagner <[EMAIL PROTECTED]> writes:
> 
> > Leif Nixon wrote:
> > > Steven Wagner <[EMAIL PROTECTED]> writes:
> > > Yes, that's what I did last week. It ain't no fun. Nagios' handling
> > > of passive service checks isn't flexible enough. And passive host
> > > checking Just Isn't Done.
> > 
> > Once again, considering you have the source at your disposal, I'm sure
> > you could work something out.  Spackling in passive host checking is
> > easier than some of the alternatives. :)
> 
> I'm not sure about that. Cue Ethan Galstad, Nagios' creator:
> 
>   "I am investigating the possibility of adding passive host checks in 
>   2.0.  However, allowing passive checks opens a whole new can of worms 
>   as far as host check logic is concerned.  For instance, if a host is 
>   reported (passively) as being down (it was previously up), what 
>   should happen with child hosts?  Should those be actively checked 
>   according to the current tree traversal logic?  Also, host checks are 
>   performed on-demand only (synchronously), so how do you handle 
>   asynchronous results?  Host checks also get priority over pending 
>   passive service check results, so that has to be figured out.
> 
>   Anyway, it isn't exactly trivial without changing a good portion of 
>   how the host check logic works.  I'll be looking into it though..."
> 
> I don't think I want to dive that deep into Nagios just to make it do
> something it really isn't designed to do.
> 
> > > Well, each metric could certainly come with default thresholds,
> > > and if you use some inheritance mechanism you could rather easily
> > > specify thresholds for all your cluster nodes:
> > 
> > In a per-node model you have to distribute the new config file to n
> > nodes every time you change something.  Which is kind of a bummer,
> > since (as I mentioned before) it seems that there's always an initial
> > tweaking period with notifying mechanisms where you're changing the
> > config every five minutes.
> 
> I'm not sure I see the point in distributing the threshold information.
> As you said, the actual notifications will be issued from a central 
> host, so why not just keep the threshold configuration there?
> 
> > > That way, you only need to specify any exceptions from the defaults.
> > > Whooshy enough?
> > 
> > The mental image I was actually going for was the loading program from
> > The Matrix, substituting endless streams of configuration directives
> > for racks o' firearms...
> 
> Yes, obviously. So I showed how a few configuration lines (cut to
> Tank, rapidly typing) could specify load thresholds for an entire
> metacluster (WHOOSH). 8^)
> 
> > *That* stuff needs to be in gmetad (or a program that fills the same
> > niche, querying one or more metadaemons or monitoring cores, chewing
> > on the XML data, and doing something with it).  Flap thresholds,
> > contact info, etc., etc., etc. ...
> > 
> > Sounds like you're volunteering to write it. :P
> 
> Here I was, hoping I could inspire someone else. 8^)
> 
>

[Ganglia-developers] Re: [Ganglia-general] Monitoring

Reply via email to