[Bacula-users] Nagios Monitoring of Backup Jobs

Allan Black Sun, 06 Feb 2011 13:38:31 -0800

I have been thinking about this for a long time, and I have tried several
ways of monitoring jobs, but none of the existing tools gave me the kind
of monitoring I wanted.


The things I want a backup monitor to do are:

* Alert if a backup job fails to start

* Alert if the job is waiting on media, or if anything happens other than
  normal execution

* Alert if the job terminates with a status other than OK

The standard way to monitor seems to be to use passive alerts which are
submitted from the backup job, and then use freshness checking to make
sure the job runs when it is supposed to. The big problem with this
approach (as I see it) is this: if a backup is delayed or had to be
restarted, then the expiry of its 'freshness' will also be delayed, so
Nagios would be late in reporting a problem next time.

Also, sending problem reports from a backup job is unreliable, since
problems with Bacula or the server might delay or prevent passive alerts.

Active services are not much use either, since plugins are stateless,
so unless a plugin maintains its own state files, it cannot tell the
difference between a job which has not started and a job which has
finished (OK or otherwise).

Having tried and failed with various techniques, I eventually came to the
conclusion that the best way to monitor backups is to run a script
independently of Bacula and use passive alerts from the script to report
the backup's progress.

So .... I got to work and wrote it. This script, which I have attached,
I have been using since April 2010 and I think it's time to contribute
to the community ....

A brief description:

I have services configured in Nagios of the form "Backup:<jobname>",
which are set up as passive alerts.

I run the monitor script from the nagios users's crontab, using entries
like this:

30 21 * * 5 /usr/local/nagios/bin/bacula_monitor Gershwin
40 21 * * 5,6 /usr/local/nagios/bin/bacula_monitor -W Catalog

The script proceeds in three main stages:

1 - Wait for the job to start & get the jobid
2 - Monitor the progress of the jobid
3 - Report the termination status

At stage 1, Nagios will be sent a warning if the job takes too long to
start, i.e. doesn't appear in the running jobs list. This will turn into
a critical alert if it takes long enough (the warning and critical
thresholds are configured in the script as defaults, but can be over-ridden
on the command line, as can all the other thresholds).

At stage 2, the job is expected to appear in the list of running jobs with
a status which is one of a short list of "acceptable" status strings. If the
status is anything else, then Nagios will be sent a warning or critical alert
after given time thresholds.

Once the job disappears from the running jobs list, the monitor moves on to
stage 3, which simply reports the termination status of the job and exits.

The "acceptable" status strings are: "is running", "Dir inserting Attributes",
and "has terminated". If the -W flag was supplied on the command line, then
"is waiting execution" is accepted as long as there is at least one more job
in the running jobs list.

As I said above, I have been using this script for almost a year, and find
that it works very well. I hope it will be of use to others ....

I have also attached another script (bnu) which sends Nagios a passive alert
to update a service with the status of a job which has already terminated. I
use this script sometimes if I have to restart a job manually, but didn't run
bacula_monitor again. If Nagios is still critical because the original job
failed, bnu will update the Nagios service.

Allan

PS I wrote this on Solaris 10, so anyone trying it under Linux will have to
change the PFE variable from "pfexec" to "sudo" (or "" if the script will be
run with sufficient privs).
A

PPS I have signed the FLA.
A

#!/bin/ksh
#
# Bacula monitor for Nagios
# Written by Allan Black
# Last Modified: 2011-01-28
#
# Usage: bacula_monitor <jobname>
#
# Description:
#
# This script will examine the output of the Bacula console "st dir"
# command and look in the "Running Jobs" section for job named as
# argument. It will then monitor the status of the job and track the
# progress of the job using Nagios commands to update a passive
# service, whose name is based on the job name.
#

STARTINT=30
MONINT=600
WARNDELAY=300
CRITDELAY=900
WARNTIME=900
CRITTIME=1800

BACULADIR=/usr/local/bacula
BACULASBIN=${BACULADIR}/sbin
BACULAETC=${BACULADIR}/etc
BCCONFIG=${BACULAETC}/bconsole.conf
BCONSOLE="${BACULASBIN}/bconsole -c ${BCCONFIG}"
STATCMD="st dir"

STATE_OK=0
STATE_WARN=1
STATE_CRIT=2
STATE_UNKNOWN=3

PATH=/usr/sbin:/usr/bin
export PATH

#
# If necessary, set to pfexec (Solaris), sudo (Linux) etc.
#
PFE=pfexec

usage() {
        print "usage: $0 [args]"
        print "args:"
        print "        -W|--waiting         (allow waiting on other jobs)"
        print "        -m|--monint          (monitoring interval)"
        print "        -w|-wt|--warntime    (warning time threshold)"
        print "        -c|-ct|--crittime    (critical time threshold)"
        print "        -s|--startint        (startup monitoring interval)"
        print "        -wd|--warndelay      (warning startup time threshold)"
        print "        -cd|--critdelay      (critical startup time threshold)"
        print "        -J|--job JobName     (Bacula job to monitor)"
        print "        JobName              (Bacula job to monitor)"
}

while [[ $# -gt 0 ]]; do
        case "$1" in
        -H | --help)
                usage >&2
                exit 0
                ;;

        -W | --waiting)
                WAITING="yes"
                ;;

        -m | --monint)
                shift
                MONINT="$1"
                ;;

        -w | -wt | --warntime)
                shift
                WARNTIME="$1"
                ;;

        -c | -ct | --crittime | --critime)
                shift
                CRITTIME="$1"
                ;;

        -s | --startint)
                shift
                STARTINT="$1"
                ;;

        -wd | --warndelay)
                shift
                WARNDELAY="$1"
                ;;

        -cd | --critdelay)
                shift
                CRITDELAY="$1"
                ;;

        -J | --job)
                shift
                JOB="$1"
                ;;

        *)
                if [[ -n "$JOB" ]]; then
                        usage >&2
                        exit 1
                fi
                JOB="$1"
                ;;
        esac

        shift
done

if [[ -z "$JOB" ]]; then
        usage >&2
        exit 1
fi

director_status()
{
        print -R "$STATCMD" | ${PFE} ${BCONSOLE}
}

itime() {
        case $(uname -s) in
        SunOS)
                truss date 2>&1 | awk '$1 == "time()" {print $NF}';;
        *)
                date '+%s';;
        esac
}

nagios() {
        if [[ $# -gt 0 ]]; then
                NSTAT="$1"
                shift
        else
                NSTAT=${STATE_UNKNOWN}
        fi

        if [[ $# -gt 0 ]]; then
                NMSG="$*"
        else
                NMSG="Job status unknown"
        fi

        nargs="$(uname -n);Backup: ${JOB};${NSTAT};${NMSG}"

        print -R "[$(itime)] PROCESS_SERVICE_CHECK_RESULT;${nargs}" \
                                        > /var/nagios/rw/nagios.cmd
}

#
# Wait for the job to start and get the Job ID. We start with the
# sleep to give the job time to start up.
#
STARTTIME=$(itime)

while [[ -z "$jobid" ]]; do
        delay=$(("$(itime)" - "$STARTTIME"))
        if [[ "$delay" -gt "$CRITDELAY" ]]; then
                nagios ${STATE_CRIT} "Job ${JOB} not running"
        elif [[ "$delay" -gt "$WARNDELAY" ]]; then
                nagios ${STATE_WARN} "Job ${JOB} not running"
        fi
        sleep ${STARTINT}

        jobid=$(director_status | awk '
                                BEGIN {
                                        running = 0;
                                        rjlist = 0;
                                        jobid = "";
                                }
                                NF == 2 && $1 == "Running" && $2 == "Jobs:" {
                                        running = 1;
                                }
                                NF == 1 && $1 ~ /^=/ {
                                        if(rjlist) exit;
                                        if(running) rjlist = 1;
                                        continue;
                                }
                                rjlist != 0 && $3 ~ /^'"$JOB"'\./ {
                                        jobid = $1;
                                }
                                END {
                                        print jobid;
                                }')
done

nagstate=${STATE_UNKNOWN}

STARTTIME=$(itime)
lgtime=${STARTTIME}
lctime=${STARTTIME}

while :; do
        runstat=$(director_status | awk '
                                BEGIN {
                                        running = 0;
                                        rjlist = 0;
                                }
                                NF == 2 && $1 == "Running" && $2 == "Jobs:" {
                                        running = 1;
                                }
                                NF == 3 && $1 == "No" && $2 == "Jobs" \
                                                        && $3 == "running." {
                                        exit;
                                }
                                NF == 1 && $1 ~ /^=/ {
                                        if(rjlist) exit;
                                        if(running) rjlist = 1;
                                        continue;
                                }
                                rjlist != 0 {
                                        print;
                                        exit;
                                }')

        jobstat=$(print -R "$runstat" | awk '
                                $1 == "'"$jobid"'" {
                                        print;
                                }')

        if [[ -z "$jobstat" ]]; then
                #
                # Job has disappeared from the Running Jobs: list
                #
                break
        fi

        #
        # Get the current status
        #
        timenow=$(itime)
        currstat=$(print -R "$jobstat" | sed \
                        -e 's/^  *[^ ][^ ]*  *[^ ][^ ]*  *[^ ][^ ]*  *//' \
                        -e 's/ *$//')

        #
        # If we are allowed to accept a job waiting execution for a
        # while (e.g. because of job priority), we ignore this state
        # as long as there is at least one other job in the list
        #
        if [[ "$currstat" = "is waiting execution"  && -n "$WAITING" ]]; then
                njobs=$(print -R "$runstat" | wc -l)
                if [[ "$njobs" -gt 1 ]]; then
                        lgtime=${timenow}
                        lctime=${timenow}
                        nagios ${STATE_OK} "$jobstat"
                        sleep ${MONINT}
                        continue
                fi
        fi

        #
        # Anything other than a normal running state is warning/critical
        # if it stays like that for too long
        #
        case "$currstat" in

        "is running" \
        | "Dir inserting Attributes" \
        | "has terminated")
                nagios ${STATE_OK} "$jobstat"
                lgtime=${timenow}
                ;;

        *)
                ngtime=$(("$timenow" - "$lgtime"))

                if [[ "$currstat" != "$laststat" ]]; then
                        lctime=${timenow}
                fi
                nctime=$(("$timenow" - "$lctime"))

                if [[ "$nctime" -gt "$CRITTIME" ]]; then
                        nagios ${STATE_CRIT} "$jobstat"
                elif [[ "$ngtime" -gt "$WARNTIME" ]]; then
                        nagios ${STATE_WARN} "$jobstat"
                fi
        esac

        laststat=${currstat}
        sleep ${MONINT}
done

jobstat=$(director_status | awk '
                        BEGIN {
                                terminated = 0;
                                tjlist = 0;
                        }
                        NF == 2 && $1 == "Terminated" && $2 == "Jobs:" {
                                terminated = 1;
                        }
                        NF == 1 && $1 ~ /^=/ {
                                if(tjlist) exit;
                                if(terminated) tjlist = 1;
                                continue;
                        }
                        tjlist != 0 && $1 == "'"$jobid"'" {
                                print;
                                exit;
                        }')

termstat=$(print -R "$jobstat" | awk '{print $6}')

if [[ "$termstat" != "OK" ]]; then
        if [[ -z "$jobstat" ]]; then
                jobstat="${JOB} failed"
        fi
        nagios ${STATE_CRIT} "Backup Error: $jobstat"
        exit 1
fi

nagios ${STATE_OK} "Backup OK: $jobstat"
exit 0

#!/bin/ksh
#
# bnu - Bacula Nagios Update
#

DATABASE=bacula
USER=root

BACULADIR=/usr/local/bacula
BACULASBIN=${BACULADIR}/sbin
BACULAETC=${BACULADIR}/etc
BCCONFIG=${BACULAETC}/bconsole.conf
BCONSOLE="${BACULASBIN}/bconsole -c ${BCCONFIG}"
STATCMD="st dir"

SELID='JobId from Job'
SELNAME='Name from Job'
SELSTAT='JobId, Level, JobFiles, JobBytes, JobStatus, EndTime, Name from Job'
ORDER='EndTime desc'

MYSQL="/usr/local/mysql/bin/mysql"
myargs="--skip-column-names --database=${DATABASE} --user=${USER}"

mycmd="${MYSQL} ${myargs}"

STATE_OK=0
STATE_WARN=1
STATE_CRIT=2
STATE_UNKNOWN=3

usage()
{
        echo "usage: $0 [-i jobid] [-j jobname] [jobname]" >&2
        exit 2
}

itime() {
        truss date 2>&1 | awk '$1 == "time()" {print $NF}'
}

nagios() {
        if [[ $# -gt 0 ]]; then
                JOB="$1"
                shift
        else
                return 1
        fi

        if [[ $# -gt 0 ]]; then
                NSTAT="$1"
                shift
        else
                NSTAT=${STATE_UNKNOWN}
        fi

        if [[ $# -gt 0 ]]; then
                NMSG="$*"
        else
                NMSG="Job status unknown"
        fi

        nargs="$(uname -n);Backup: ${JOB};${NSTAT};${NMSG}"

        print -R "[$(itime)] PROCESS_SERVICE_CHECK_RESULT;${nargs}" \
                                        > /var/nagios/rw/nagios.cmd
}

while getopts i:j: arg; do
        case "$arg" in
        i)
                [[ -n "$jobid" || -n "$jobname" ]] && usage
                jobid="$OPTARG"
                ;;
        j)
                [[ -n "$jobid" || -n "$jobname" ]] && usage
                jobname="$OPTARG"
                ;;
        \?)
                usage
                ;;
        esac
done
shift `expr $OPTIND - 1`

if [[ $# -gt 0 ]]; then
        if [[ $# -eq 1 ]]; then
                [[ -n "$jobid" || -n "$jobname" ]] && usage
                jobname="$1"
        else
                usage
        fi
fi

[[ -z "$jobid" && -z "$jobname" ]] && usage

if [[ -z "$jobid" ]]; then
        where='Name = "'"$jobname"'"'
        query="select ${SELID} where ${where} order by ${ORDER};"
        jobid=$(print -R "$query" | ${mycmd} | head -1)

        if [[ -z "$jobid" ]]; then
                echo "$0: cannot get latest jobid for "\""$jobname"\" >&2
                exit 3
        fi
fi

if [[ -z "$jobname" ]]; then
        where='JobId = '"$jobid"
        query="select ${SELNAME} where ${where};"
        jobname=$(print -R "$query" | ${mycmd})

        if [[ -z "$jobname" ]]; then
                echo "$0: cannot get job name for jobid "\""$jobid"\" >&2
                exit 3
        fi
fi

jobstat=$(print -R "$STATCMD" | pfexec ${BCONSOLE} | awk '
                        BEGIN {
                                terminated = 0;
                                tjlist = 0;
                        }
                        NF == 2 && $1 == "Terminated" && $2 == "Jobs:" {
                                terminated = 1;
                        }
                        NF == 1 && $1 ~ /^=/ {
                                if(tjlist) exit;
                                if(terminated) tjlist = 1;
                                continue;
                        }
                        tjlist != 0 && $1 == "'"$jobid"'" {
                                print;
                                exit;
                        }' | tail -1)

if [[ -n "$jobstat" ]]; then
        termstat=$(print -R "$jobstat" | awk '{print $6}')
        termok="OK"
else
        where='JobId = '"$jobid"
        query="select ${SELSTAT} where ${where};"
        jobstat=$(print -R "$query" | ${mycmd} | head -1)
        termstat=$(print -R "$jobstat" | awk '{print $5}')
        termok="T"
fi

if [[ "$termstat" = "$termok" ]]; then
        nagios "$jobname" ${STATE_OK} "Backup OK: $jobstat"
else
        nagios "$jobname" ${STATE_CRIT} "Backup Error: $jobstat"
fi

exit 0

------------------------------------------------------------------------------
The modern datacenter depends on network connectivity to access resources
and provide services. The best practices for maximizing a physical server's
connectivity to a physical network are well understood - see how these
rules translate into the virtual world? 
http://p.sf.net/sfu/oracle-sfdevnlfb

_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

[Bacula-users] Nagios Monitoring of Backup Jobs

Reply via email to