I have been thinking about this for a long time, and I have tried several
ways of monitoring jobs, but none of the existing tools gave me the kind
of monitoring I wanted.
The things I want a backup monitor to do are:
* Alert if a backup job fails to start
* Alert if the job is waiting on media, or if anything happens other than
normal execution
* Alert if the job terminates with a status other than OK
The standard way to monitor seems to be to use passive alerts which are
submitted from the backup job, and then use freshness checking to make
sure the job runs when it is supposed to. The big problem with this
approach (as I see it) is this: if a backup is delayed or had to be
restarted, then the expiry of its 'freshness' will also be delayed, so
Nagios would be late in reporting a problem next time.
Also, sending problem reports from a backup job is unreliable, since
problems with Bacula or the server might delay or prevent passive alerts.
Active services are not much use either, since plugins are stateless,
so unless a plugin maintains its own state files, it cannot tell the
difference between a job which has not started and a job which has
finished (OK or otherwise).
Having tried and failed with various techniques, I eventually came to the
conclusion that the best way to monitor backups is to run a script
independently of Bacula and use passive alerts from the script to report
the backup's progress.
So .... I got to work and wrote it. This script, which I have attached,
I have been using since April 2010 and I think it's time to contribute
to the community ....
A brief description:
I have services configured in Nagios of the form "Backup:<jobname>",
which are set up as passive alerts.
I run the monitor script from the nagios users's crontab, using entries
like this:
30 21 * * 5 /usr/local/nagios/bin/bacula_monitor Gershwin
40 21 * * 5,6 /usr/local/nagios/bin/bacula_monitor -W Catalog
The script proceeds in three main stages:
1 - Wait for the job to start & get the jobid
2 - Monitor the progress of the jobid
3 - Report the termination status
At stage 1, Nagios will be sent a warning if the job takes too long to
start, i.e. doesn't appear in the running jobs list. This will turn into
a critical alert if it takes long enough (the warning and critical
thresholds are configured in the script as defaults, but can be over-ridden
on the command line, as can all the other thresholds).
At stage 2, the job is expected to appear in the list of running jobs with
a status which is one of a short list of "acceptable" status strings. If the
status is anything else, then Nagios will be sent a warning or critical alert
after given time thresholds.
Once the job disappears from the running jobs list, the monitor moves on to
stage 3, which simply reports the termination status of the job and exits.
The "acceptable" status strings are: "is running", "Dir inserting Attributes",
and "has terminated". If the -W flag was supplied on the command line, then
"is waiting execution" is accepted as long as there is at least one more job
in the running jobs list.
As I said above, I have been using this script for almost a year, and find
that it works very well. I hope it will be of use to others ....
I have also attached another script (bnu) which sends Nagios a passive alert
to update a service with the status of a job which has already terminated. I
use this script sometimes if I have to restart a job manually, but didn't run
bacula_monitor again. If Nagios is still critical because the original job
failed, bnu will update the Nagios service.
Allan
PS I wrote this on Solaris 10, so anyone trying it under Linux will have to
change the PFE variable from "pfexec" to "sudo" (or "" if the script will be
run with sufficient privs).
A
PPS I have signed the FLA.
A
#!/bin/ksh
#
# Bacula monitor for Nagios
# Written by Allan Black
# Last Modified: 2011-01-28
#
# Usage: bacula_monitor <jobname>
#
# Description:
#
# This script will examine the output of the Bacula console "st dir"
# command and look in the "Running Jobs" section for job named as
# argument. It will then monitor the status of the job and track the
# progress of the job using Nagios commands to update a passive
# service, whose name is based on the job name.
#
STARTINT=30
MONINT=600
WARNDELAY=300
CRITDELAY=900
WARNTIME=900
CRITTIME=1800
BACULADIR=/usr/local/bacula
BACULASBIN=${BACULADIR}/sbin
BACULAETC=${BACULADIR}/etc
BCCONFIG=${BACULAETC}/bconsole.conf
BCONSOLE="${BACULASBIN}/bconsole -c ${BCCONFIG}"
STATCMD="st dir"
STATE_OK=0
STATE_WARN=1
STATE_CRIT=2
STATE_UNKNOWN=3
PATH=/usr/sbin:/usr/bin
export PATH
#
# If necessary, set to pfexec (Solaris), sudo (Linux) etc.
#
PFE=pfexec
usage() {
print "usage: $0 [args]"
print "args:"
print " -W|--waiting (allow waiting on other jobs)"
print " -m|--monint (monitoring interval)"
print " -w|-wt|--warntime (warning time threshold)"
print " -c|-ct|--crittime (critical time threshold)"
print " -s|--startint (startup monitoring interval)"
print " -wd|--warndelay (warning startup time threshold)"
print " -cd|--critdelay (critical startup time threshold)"
print " -J|--job JobName (Bacula job to monitor)"
print " JobName (Bacula job to monitor)"
}
while [[ $# -gt 0 ]]; do
case "$1" in
-H | --help)
usage >&2
exit 0
;;
-W | --waiting)
WAITING="yes"
;;
-m | --monint)
shift
MONINT="$1"
;;
-w | -wt | --warntime)
shift
WARNTIME="$1"
;;
-c | -ct | --crittime | --critime)
shift
CRITTIME="$1"
;;
-s | --startint)
shift
STARTINT="$1"
;;
-wd | --warndelay)
shift
WARNDELAY="$1"
;;
-cd | --critdelay)
shift
CRITDELAY="$1"
;;
-J | --job)
shift
JOB="$1"
;;
*)
if [[ -n "$JOB" ]]; then
usage >&2
exit 1
fi
JOB="$1"
;;
esac
shift
done
if [[ -z "$JOB" ]]; then
usage >&2
exit 1
fi
director_status()
{
print -R "$STATCMD" | ${PFE} ${BCONSOLE}
}
itime() {
case $(uname -s) in
SunOS)
truss date 2>&1 | awk '$1 == "time()" {print $NF}';;
*)
date '+%s';;
esac
}
nagios() {
if [[ $# -gt 0 ]]; then
NSTAT="$1"
shift
else
NSTAT=${STATE_UNKNOWN}
fi
if [[ $# -gt 0 ]]; then
NMSG="$*"
else
NMSG="Job status unknown"
fi
nargs="$(uname -n);Backup: ${JOB};${NSTAT};${NMSG}"
print -R "[$(itime)] PROCESS_SERVICE_CHECK_RESULT;${nargs}" \
> /var/nagios/rw/nagios.cmd
}
#
# Wait for the job to start and get the Job ID. We start with the
# sleep to give the job time to start up.
#
STARTTIME=$(itime)
while [[ -z "$jobid" ]]; do
delay=$(("$(itime)" - "$STARTTIME"))
if [[ "$delay" -gt "$CRITDELAY" ]]; then
nagios ${STATE_CRIT} "Job ${JOB} not running"
elif [[ "$delay" -gt "$WARNDELAY" ]]; then
nagios ${STATE_WARN} "Job ${JOB} not running"
fi
sleep ${STARTINT}
jobid=$(director_status | awk '
BEGIN {
running = 0;
rjlist = 0;
jobid = "";
}
NF == 2 && $1 == "Running" && $2 == "Jobs:" {
running = 1;
}
NF == 1 && $1 ~ /^=/ {
if(rjlist) exit;
if(running) rjlist = 1;
continue;
}
rjlist != 0 && $3 ~ /^'"$JOB"'\./ {
jobid = $1;
}
END {
print jobid;
}')
done
nagstate=${STATE_UNKNOWN}
STARTTIME=$(itime)
lgtime=${STARTTIME}
lctime=${STARTTIME}
while :; do
runstat=$(director_status | awk '
BEGIN {
running = 0;
rjlist = 0;
}
NF == 2 && $1 == "Running" && $2 == "Jobs:" {
running = 1;
}
NF == 3 && $1 == "No" && $2 == "Jobs" \
&& $3 == "running." {
exit;
}
NF == 1 && $1 ~ /^=/ {
if(rjlist) exit;
if(running) rjlist = 1;
continue;
}
rjlist != 0 {
print;
exit;
}')
jobstat=$(print -R "$runstat" | awk '
$1 == "'"$jobid"'" {
print;
}')
if [[ -z "$jobstat" ]]; then
#
# Job has disappeared from the Running Jobs: list
#
break
fi
#
# Get the current status
#
timenow=$(itime)
currstat=$(print -R "$jobstat" | sed \
-e 's/^ *[^ ][^ ]* *[^ ][^ ]* *[^ ][^ ]* *//' \
-e 's/ *$//')
#
# If we are allowed to accept a job waiting execution for a
# while (e.g. because of job priority), we ignore this state
# as long as there is at least one other job in the list
#
if [[ "$currstat" = "is waiting execution" && -n "$WAITING" ]]; then
njobs=$(print -R "$runstat" | wc -l)
if [[ "$njobs" -gt 1 ]]; then
lgtime=${timenow}
lctime=${timenow}
nagios ${STATE_OK} "$jobstat"
sleep ${MONINT}
continue
fi
fi
#
# Anything other than a normal running state is warning/critical
# if it stays like that for too long
#
case "$currstat" in
"is running" \
| "Dir inserting Attributes" \
| "has terminated")
nagios ${STATE_OK} "$jobstat"
lgtime=${timenow}
;;
*)
ngtime=$(("$timenow" - "$lgtime"))
if [[ "$currstat" != "$laststat" ]]; then
lctime=${timenow}
fi
nctime=$(("$timenow" - "$lctime"))
if [[ "$nctime" -gt "$CRITTIME" ]]; then
nagios ${STATE_CRIT} "$jobstat"
elif [[ "$ngtime" -gt "$WARNTIME" ]]; then
nagios ${STATE_WARN} "$jobstat"
fi
esac
laststat=${currstat}
sleep ${MONINT}
done
jobstat=$(director_status | awk '
BEGIN {
terminated = 0;
tjlist = 0;
}
NF == 2 && $1 == "Terminated" && $2 == "Jobs:" {
terminated = 1;
}
NF == 1 && $1 ~ /^=/ {
if(tjlist) exit;
if(terminated) tjlist = 1;
continue;
}
tjlist != 0 && $1 == "'"$jobid"'" {
print;
exit;
}')
termstat=$(print -R "$jobstat" | awk '{print $6}')
if [[ "$termstat" != "OK" ]]; then
if [[ -z "$jobstat" ]]; then
jobstat="${JOB} failed"
fi
nagios ${STATE_CRIT} "Backup Error: $jobstat"
exit 1
fi
nagios ${STATE_OK} "Backup OK: $jobstat"
exit 0
#!/bin/ksh
#
# bnu - Bacula Nagios Update
#
DATABASE=bacula
USER=root
BACULADIR=/usr/local/bacula
BACULASBIN=${BACULADIR}/sbin
BACULAETC=${BACULADIR}/etc
BCCONFIG=${BACULAETC}/bconsole.conf
BCONSOLE="${BACULASBIN}/bconsole -c ${BCCONFIG}"
STATCMD="st dir"
SELID='JobId from Job'
SELNAME='Name from Job'
SELSTAT='JobId, Level, JobFiles, JobBytes, JobStatus, EndTime, Name from Job'
ORDER='EndTime desc'
MYSQL="/usr/local/mysql/bin/mysql"
myargs="--skip-column-names --database=${DATABASE} --user=${USER}"
mycmd="${MYSQL} ${myargs}"
STATE_OK=0
STATE_WARN=1
STATE_CRIT=2
STATE_UNKNOWN=3
usage()
{
echo "usage: $0 [-i jobid] [-j jobname] [jobname]" >&2
exit 2
}
itime() {
truss date 2>&1 | awk '$1 == "time()" {print $NF}'
}
nagios() {
if [[ $# -gt 0 ]]; then
JOB="$1"
shift
else
return 1
fi
if [[ $# -gt 0 ]]; then
NSTAT="$1"
shift
else
NSTAT=${STATE_UNKNOWN}
fi
if [[ $# -gt 0 ]]; then
NMSG="$*"
else
NMSG="Job status unknown"
fi
nargs="$(uname -n);Backup: ${JOB};${NSTAT};${NMSG}"
print -R "[$(itime)] PROCESS_SERVICE_CHECK_RESULT;${nargs}" \
> /var/nagios/rw/nagios.cmd
}
while getopts i:j: arg; do
case "$arg" in
i)
[[ -n "$jobid" || -n "$jobname" ]] && usage
jobid="$OPTARG"
;;
j)
[[ -n "$jobid" || -n "$jobname" ]] && usage
jobname="$OPTARG"
;;
\?)
usage
;;
esac
done
shift `expr $OPTIND - 1`
if [[ $# -gt 0 ]]; then
if [[ $# -eq 1 ]]; then
[[ -n "$jobid" || -n "$jobname" ]] && usage
jobname="$1"
else
usage
fi
fi
[[ -z "$jobid" && -z "$jobname" ]] && usage
if [[ -z "$jobid" ]]; then
where='Name = "'"$jobname"'"'
query="select ${SELID} where ${where} order by ${ORDER};"
jobid=$(print -R "$query" | ${mycmd} | head -1)
if [[ -z "$jobid" ]]; then
echo "$0: cannot get latest jobid for "\""$jobname"\" >&2
exit 3
fi
fi
if [[ -z "$jobname" ]]; then
where='JobId = '"$jobid"
query="select ${SELNAME} where ${where};"
jobname=$(print -R "$query" | ${mycmd})
if [[ -z "$jobname" ]]; then
echo "$0: cannot get job name for jobid "\""$jobid"\" >&2
exit 3
fi
fi
jobstat=$(print -R "$STATCMD" | pfexec ${BCONSOLE} | awk '
BEGIN {
terminated = 0;
tjlist = 0;
}
NF == 2 && $1 == "Terminated" && $2 == "Jobs:" {
terminated = 1;
}
NF == 1 && $1 ~ /^=/ {
if(tjlist) exit;
if(terminated) tjlist = 1;
continue;
}
tjlist != 0 && $1 == "'"$jobid"'" {
print;
exit;
}' | tail -1)
if [[ -n "$jobstat" ]]; then
termstat=$(print -R "$jobstat" | awk '{print $6}')
termok="OK"
else
where='JobId = '"$jobid"
query="select ${SELSTAT} where ${where};"
jobstat=$(print -R "$query" | ${mycmd} | head -1)
termstat=$(print -R "$jobstat" | awk '{print $5}')
termok="T"
fi
if [[ "$termstat" = "$termok" ]]; then
nagios "$jobname" ${STATE_OK} "Backup OK: $jobstat"
else
nagios "$jobname" ${STATE_CRIT} "Backup Error: $jobstat"
fi
exit 0
------------------------------------------------------------------------------
The modern datacenter depends on network connectivity to access resources
and provide services. The best practices for maximizing a physical server's
connectivity to a physical network are well understood - see how these
rules translate into the virtual world?
http://p.sf.net/sfu/oracle-sfdevnlfb
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users