On 2/6/2011 4:31 PM, Allan Black wrote: > I have been thinking about this for a long time, and I have tried several > ways of monitoring jobs, but none of the existing tools gave me the kind > of monitoring I wanted. > > The things I want a backup monitor to do are: > > * Alert if a backup job fails to start > > * Alert if the job is waiting on media, or if anything happens other than > normal execution > > * Alert if the job terminates with a status other than OK > > The standard way to monitor seems to be to use passive alerts which are > submitted from the backup job, and then use freshness checking to make > sure the job runs when it is supposed to. The big problem with this > approach (as I see it) is this: if a backup is delayed or had to be > restarted, then the expiry of its 'freshness' will also be delayed, so > Nagios would be late in reporting a problem next time. > > Also, sending problem reports from a backup job is unreliable, since > problems with Bacula or the server might delay or prevent passive alerts. > > Active services are not much use either, since plugins are stateless, > so unless a plugin maintains its own state files, it cannot tell the > difference between a job which has not started and a job which has > finished (OK or otherwise). > > Having tried and failed with various techniques, I eventually came to the > conclusion that the best way to monitor backups is to run a script > independently of Bacula and use passive alerts from the script to report > the backup's progress. > > So .... I got to work and wrote it. This script, which I have attached, > I have been using since April 2010 and I think it's time to contribute > to the community .... > > A brief description: > > I have services configured in Nagios of the form "Backup:<jobname>", > which are set up as passive alerts. > > I run the monitor script from the nagios users's crontab, using entries > like this: > > 30 21 * * 5 /usr/local/nagios/bin/bacula_monitor Gershwin > 40 21 * * 5,6 /usr/local/nagios/bin/bacula_monitor -W Catalog
One entry per job? > The script proceeds in three main stages: > > 1 - Wait for the job to start & get the jobid > 2 - Monitor the progress of the jobid > 3 - Report the termination status > > At stage 1, Nagios will be sent a warning if the job takes too long to > start, i.e. doesn't appear in the running jobs list. This will turn into > a critical alert if it takes long enough (the warning and critical > thresholds are configured in the script as defaults, but can be over-ridden > on the command line, as can all the other thresholds). > > At stage 2, the job is expected to appear in the list of running jobs with > a status which is one of a short list of "acceptable" status strings. If > the > status is anything else, then Nagios will be sent a warning or critical > alert > after given time thresholds. > > Once the job disappears from the running jobs list, the monitor moves on to > stage 3, which simply reports the termination status of the job and exits. > > The "acceptable" status strings are: "is running", "Dir inserting > Attributes", > and "has terminated". If the -W flag was supplied on the command line, then > "is waiting execution" is accepted as long as there is at least one more > job > in the running jobs list. > > As I said above, I have been using this script for almost a year, and find > that it works very well. I hope it will be of use to others .... > > I have also attached another script (bnu) which sends Nagios a passive > alert > to update a service with the status of a job which has already > terminated. I > use this script sometimes if I have to restart a job manually, but > didn't run > bacula_monitor again. If Nagios is still critical because the original job > failed, bnu will update the Nagios service. How do I use bnu? -- Dan Langille - http://langille.org/ ------------------------------------------------------------------------------ The modern datacenter depends on network connectivity to access resources and provide services. The best practices for maximizing a physical server's connectivity to a physical network are well understood - see how these rules translate into the virtual world? http://p.sf.net/sfu/oracle-sfdevnlfb _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users