On Tue, May 16, 2006 at 02:46:54PM -0400, Ed Ravin wrote:
> I need to automate the "kick something when it falls over" stage of
> system management. Mon is the way we detect that things have fallen
> over, but the host Mon runs on is not the host that has the privileges
> to kick things. So here's my question:
>
> Has anyone built a Mon client that can make decisions (or invoke scripts)
> based on the status of a particular service? I can cook something up
> if needed, but thought it would be wise to see what other people are doing.
> I think what I need is a client that will return a non-zero exit status
> if a particular watch/service is down for N seconds and not acked.
Didn't get any responses. I ended up updating the "monfailures" client
to give it a few new features to dump out the individual fields in
Mon's entry for the service, and to control listing based on the value
of a field. A bit primitive, but useful for when you need simple
information that would otherwise require putting the Mon API interface
into some other script. The new version of monfailures is attached.
I also improved the -include and -exclude features to work in more cases,
and to work for service names as well as watch names, and added a perldoc
man page.
-- Ed
#!/usr/local/bin/perl5.6.1 -w
# Quickly show Mon failure status from command line.
# to configure, hard-code the user and password for either
# your public Mon username or a username that is only allowed
# to use the "list" command and nothing else. I run this
# script out of inetd on the mon server so the people who can
# see its results can't read the script (and see the hard-coded
# password).
# use --exclude or --include (or set their default values in the
# script) to exclude or include only particular regexp matches of
# watches.
# other features (-fields, -match) for getting out more data
# or for testing for failed services via command line
# Written by Ed Ravin <[EMAIL PROTECTED]> Jan 2002.
# made available to the public by courtesy of PANIX (http://www.panix.com).
# This script is licensed under the GPL.
# Updated May 2006 with field control and other features
# $Header: /devel/build/NetBSD/mon/mon-1.1-devel/mon/clients/RCS/monfailures,v
1.8 2006/05/20 01:23:52 root Exp $
use strict;
my %opt;
use Getopt::Long;
my $usage="Usage: monfailures [--server host] [--port port] [--user user]
[--password pw] [--timeout n] [--include watch-regexp] [--exclude watch-regexp]
[--fields {ALL|f1,f2,...}] [--testfield 'fieldname op value']\n";
die $usage unless
GetOptions (\%opt, "debug", "testfield=s", "fields=s", "server=s", "port=s",
"timeout=i", "user=s", "password=s", "include=s", "exclude=s");
############################ configurable stuff - or put in defaults file
my $defaults_file= "/etc/mon/monfailures.cf";
my $default_user="public";
my $default_password= "readonly";
my $default_server= "localhost";
my $default_timeout= 120;
my $default_include= ".*";
my $default_exclude= "";
############################
my $debug= $opt{'debug'} || 0;
my @fields= ();
if (exists($opt{'fields'}))
{
@fields= split ',' , $opt{'fields'};
}
my $teststr= $opt{'testfield'} || "";
my ($testfield, $testop, $testval)= ("", "", "");
if (length($teststr))
{
($testfield, $testop, $testval)= split(' ', $teststr);
warn "testfield=$testfield, testop=$testop, testval=$testval\n" if
$debug;
die "$0: illegal characters in --testfield option\n"
if $testfield =~ /[`'"$ ]/;
die "$0: illegal fieldname in --testfield option\n"
unless $testfield =~ /^\w+$/;
die "$0: illegal test operator in --testfield option\n"
unless ($testop eq "+" or $testop eq "-" or $testop eq "=="
or $testop eq "!=" or $testop eq ">" or $testop eq "<");
die "$0: illegal integer value in --testfield option\n" unless
$testval =~ /^-?\d+$/;
}
my (%failures, %disabled);
my ($now);
use Mon::Client;
# format of defaults file:
# keyword = VALUE (no spaces allowed in VALUE)
# leading # sign for comments
# valid keywords: user, password, server, include, exclude, timeout
if (-f $defaults_file)
{
if ( open(DEF, "<$defaults_file"))
{
my @defaults= <DEF>;
close DEF;
foreach $_ (@defaults)
{
next if /^\s*#/;
next if /^$/;
$default_user= $1 if /^\s*user\s*=\s*(\S+)/;
$default_password= $1 if /^\s*password\s*=\s*(\S+)/;
$default_server= $1 if /^\s*server\s*=\s*(\S+)/;
$default_include= $1 if /^\s*include\s*=\s*(\S+)/;
$default_exclude= $1 if /^\s*exclude\s*=\s*(\S+)/;
$default_timeout= $1 if /^\s*exclude\s*=\s*(\S+)/;
}
}
else
{
warn "monfailures: cannot open defaults file $defaults_file:
$!\n";
}
}
my $include_filter= $opt{'include'} || $default_include;
my $exclude_filter= $opt{'exclude'} || $default_exclude;
my $timeout= $opt{'timeout'} || $default_timeout;
my $mon;
# find the client
if (!defined ($mon = Mon::Client->new)) {
die "$0: could not create client object: $@";
}
$mon->host ($opt{'server'} || $default_server);
$mon->port ($opt{'port'}) if (defined $opt{'port'});
$mon->username($opt{'user'} ||
$ENV{'MONFAILURES_USER'} ||
$default_user);
$mon->password($opt{'password'} ||
$ENV{'MONFAILURES_PASSWORD'} ||
$default_password);
alarm($timeout); # die if we get stuck talking to Mon
$mon->connect;
die "$0: Could not connect to server: " . $mon->error . "\n"
unless $mon->connected;
$mon->login;
die "$0: login failure: " . $mon->error . "\n" if $mon->error;
# Load data from Mon
%disabled= $mon->list_disabled;
die "$0: Error doing list_disabled : " . $mon->error
if ($mon->error);
%failures = $mon->list_failures;
die "$0: Error doing list_failures : " . $mon->error
if ($mon->error);
$now= time; # time mon data was fetched
# group=thathost service=port8888 opstatus=0 last_opstatus=0 exitval=1 timer=11
# last_success=0 last_trap=0 last_check=955058065 ack=0 ackcomment=''
# alerts_sent=0 depstatus=0 depend='' monitor='tcp.monitor -p 8888'
# last_summary='thathost'
# last_detail='\0athathost could not connect: Connection refused\0a'
# last_failure=955058067 interval=60 first_failure=955055062
# failure_duration=3052
my ($watch, $service, $downtime, $summary, $acked);
format STDOUT_TOP =
Hostgroup:Service Down Since Error Summary
----------------- ---------- -------------
.
format STDOUT =
@<<<<<<<<<<<<<<<<<<<<<<<<<<<<< @<<<<<<<<<<<<<<<<<< @<<<<<<<<<<<<<<<<<<<<<<<<<
$watch . ":" . $service, $downtime, $summary
.
# list out any failures
my $failures_shown= 0;
if (%failures)
{
foreach $watch (keys %failures) {
next if exists($disabled{"watches"}{$watch});
foreach $service (keys %{$failures{$watch}}) {
next if length($exclude_filter) and
"$watch:$service" =~ $exclude_filter;
next unless "$watch:$service" =~ $include_filter;
next if exists($disabled{"services"}{$watch}{$service});
my $sref= \%{$failures{$watch}->{$service}};
# It's on the include list, and it's down.
# Now we test an individual field if asked on command
line
if (length($teststr)) {
warn "$0: testing $teststr\n" if $debug;
die "$0: field $testfield does not exist,
aborting.\n"
unless exists($sref->{$testfield});
next unless eval "($sref->{$testfield} $testop
$testval)";
}
# print out the summary failure info for the service
# or print specific field-based info as per
command-line args
if (@fields == 0) {
$downtime= localtime $sref->{'first_failure'};
$acked= $sref->{'ack'} !=0;
$summary= $sref->{'last_summary'};
$summary= "[acked] $summary" if $acked;
write;
} else {
print "$watch:$service: ";
if (@fields == 1 && $fields[0] eq
"ALL") {
foreach my $field (keys
%{$sref}) {
print
"$field=$sref->{$field}\t";
}
} else {
foreach my $field (@fields) {
print
"$field=$sref->{$field}\t"
if
exists($sref->{$field});
}
}
print "\n";
}
$failures_shown= 1;
}
}
if ($failures_shown)
{
print "\n";
exit(1);
}
}
print "No failures found.\n";
exit(0);
__END__
=head1 NAME
monfailures - display failed services in Mon
=head1 SYNOPSIS
B<monfailures> [--server I<host>] [--port I<port>] [--timeout I<seconds>]
[--user I<username>] [--password I<password>]
[--include I<watch-regexp>] [--exclude I<watch-regexp>]
[--fields { ALL | f1,f2,f3 [...] } ]
[--testfield "I<fieldname op value>" ]
=head1 DESCRIPTION
B<monfailures> queries a Mon server and displays a quick summary of
failed services. With the available options, you may display only
a subset of the services being monitored, exclude a subset of services,
display the individual fields in Mon's record for a service, and only
display a service if a particular field's value passes a numeric test.
B<monfailures> will attempt to read in a configuration file from
/etc/mon/monfailures.cf . Several of its options can be set there,
in the form:
=item B<keyword> = I<value>
Where B<keyword> is one of B<user>, B<password>, B<server>, B<timeout>,
B<include>, or B<exclude>. Blank lines or lines that begin with a # sign
are ignored. Options specified on the command line override any found
in the configuration file.
=head1 OPTIONS
=head2 Connecting to the Mon server
=item B<--user I<username>>
=item B<--password I<password>> Specify the username and/or password
to connect to the Mon server. The default values (public/readonly) are
hard-coded in the script. The username used by B<monfailures> needs only
permissions to the "list" command in Mon.
=item B<--server I<hostname>>
=item B<--port I<port-number>> Specify the hostname and port number
of the Mon server. The defaults are "localhost" and 2583.
=item B<--timeout I<seconds>> Abort if the transaction with the Mon
server takes longer than the specified number of seconds. The default
value is 120.
=head2 Filter options
=item B<--include I<watch-regexp>> Only list failed services whose
servicenames match Perl regular expression I<watch-regexp>. The watch
and servicename are concatenated together with a colon character, similar
to the way they are referenced in dependency clauses in the Mon configuration
file, and the combined watch:servicename are compared against I<watch-regexp>.
If the failed service matches, it is considered for display by B<monfailures>,
otherwise it is skipped.
=item B<--exclude I<watch-regexp>> Do not list failed services woh
servicenames match Perl regular expresson I<watch-regexp>. This
option overrides any matches from the B<--include> option.
=head2 Field options
=item B<--fields> { ALL | I<f1>,I<f2>,I<f3>[, ...] } Instead of displaying
a quick summary of the failed service, display all the raw fields used
by Mon to track the service (the B<--fields ALL> option) or just the
raw fields specified in a comma-separated list.
=item B<--testfield> "I<fieldname operator value>" Only display the
failed service if field I<fieldname> matches the numeric test specified.
I<operator> must be one of the four relational operators: == (test for
equality), != (test for inequality), < (test if field is less than), or
> (test if field is greater than). I<value> must be an integer. If the
specified expression evaluated to true, B<monfailures> will display the
service. This feature is intended to allow scripts calling B<monfailures>
to make decisions (for example, to reboot a server after a service
has been down for longer than N seconds).
=head1 RETURN VALUE
B<monfailures> will return zero if no failures were found that matched
the requested criteria (and passed any specified test), or 1 if any
failures were displayed.
=head1 EXAMPLES
=item B<monfailures --include ^mailservers:smtp --testfield "exitval == 1">
Check Mon for failed services with the watch name "mailservers" and a
service name beginning with "smtp", and report on the failure if the Mon
field "exitval" has a value of 1.
=item B<monfailures --fields ALL>
Report on all failed services in raw format - this will let you see
the field names to choose for the B<--fields> or B<--testfields> options.
=head1 NOTES
Only one instance of any option may be specified.
Disabled services are not listed.
=head1 AUTHOR
B<monfailures> was written by Ed Ravin <[EMAIL PROTECTED]>, and has been
made available to the public under the GNU Public License courtesy
of PANIX (http://panix.com).
_______________________________________________
mon mailing list
[email protected]
http://linux.kernel.org/mailman/listinfo/mon