Currently all of my machines, except for one client, are running Bacula
5.0.2.  (One Windows 2000 machine still has a 3.0.1 client.)  I have two
Storage daemons, one on my main server on its disk array (babylon4-sd),
one on this machine on an LTO2 drive (babylon5-sd).  The Director and
catalog are on babylon4.  Monthly Full backups run to the LTO2 drive on
babylon5-sd.  Weekly differentials and nightly incrementals run to the
disk volumes on babylon4-sd.  Maximum concurrency is set to 25 on the
Director, 20 everywhere else.  babylon5 runs Gentoo Linux, babylon4 runs
Solaris 10.

I ran my last Full backups on August 3.  I last rebooted this machine
August 6, and restarted the Director on babylon4 afterwards.  All jobs
since then have run on babylon4-sd, with no problems, up to and
including the differentials that ran early yesterday morning.

Yesterday morning, I could not start Firefox on babylon5 due to a
corrupted profile, and I had to use both babylon4-sd and babylon5-sd to
restore my Firefox profile from (to be safe) two days earlier.

Last night's scheduled incrementals did not run.  If I look at the
Director status right now, all seven jobs are marked as "waiting on max
Storage jobs".  There are no jobs running on EITHER Storage.  As far as
I can determine, they appear to be waiting on babylon5-sd, which has no
jobs running on it.  All daemons are running and responding.

Other users have reported running into this bug, but it has not been
reproducible on demand.  No solution has been found so far except to
restart the Director.  I had previously speculated that it occurred when
a Storage daemon has been restarted more recently than the Director, but
in this case, neither Storage daemon has been restarted in the last ten
days.  Neither Storage daemon is even in use at the moment, but the
Director thinks one or the other is at its job limit.

Storage daemon status:

status storage=babylon5-sd

babylon5-sd Version: 5.0.2 (28 April 2010) i686-pc-linux-gnu gentoo
Daemon started 06-Aug-10 20:29, 1 Job run since started.
 Heap: heap=970,752 smbytes=539,241 max_bytes=653,649 bufs=78 max_bufs=206
Sizes: boffset_t=8 size_t=4 int32_t=4 int64_t=8

Running Jobs:
No Jobs running.
====

Jobs waiting to reserve a drive:
====

Terminated Jobs:
 JobId  Level    Files      Bytes   Status   Finished        Name
===================================================================
  2773  Full    157,512    1.239 G  OK       03-Aug-10 00:33
Spikyfishthing_Backup
  2778  Full    468,349    10.04 G  OK       03-Aug-10 01:55 Babylon5_Backup
  2774  Full     90,615    13.58 G  OK       03-Aug-10 02:12
Mabolgamp_Backup
  2775  Full     62,888    17.24 G  OK       03-Aug-10 02:34 Lake_Backup
  2771  Full    123,185    27.62 G  OK       03-Aug-10 03:23 Zocalo_Backup
  2772  Full     81,133    67.58 G  OK       03-Aug-10 04:28 Vorlon_Backup
  2777  Full    683,366    450.5 G  OK       03-Aug-10 17:42 Babylon4_Backup
  2788  Full          1    200.8 M  OK       03-Aug-10 18:11 Catalog_Backup
  2797                0         0   OK       04-Aug-10 14:12 Restore
  2887                0         0   OK       16-Aug-10 10:57 Restore
====

Device status:
Device "Ultrium-LTO2" (/dev/nst0) is not open.
    Device is BLOCKED. User unmounted.
====

Used Volume status:
====

====

status storage=babylon4-sd

babylon4-sd Version: 5.0.2 (28 April 2010) i386-pc-solaris2.10 solaris 5.10
Daemon started 06-Aug-10 21:30, 71 Jobs run since started.
 Heap: heap=1,196,032 smbytes=541,200 max_bytes=1,099,549 bufs=83
max_bufs=367
Sizes: boffset_t=8 size_t=4 int32_t=4 int64_t=8

Running Jobs:
No Jobs running.
====

Jobs waiting to reserve a drive:
====

Terminated Jobs:
 JobId  Level    Files      Bytes   Status   Finished        Name
===================================================================
  2872  Incr        195    816.7 M  OK       15-Aug-10 03:50 Babylon4_Backup
  2878  Incr          1    197.5 M  OK       15-Aug-10 09:43 Catalog_Backup
  2882  Diff        668    407.9 M  OK       16-Aug-10 03:40 Lake_Backup
  2881  Diff    103,886    1.902 G  OK       16-Aug-10 03:40 Babylon5_Backup
  2883  Diff     10,890    891.8 M  OK       16-Aug-10 03:45
Mabolgamp_Backup
  2885  Diff      7,772    3.836 G  OK       16-Aug-10 03:52 Zocalo_Backup
  2884  Diff      4,856    10.40 G  OK       16-Aug-10 04:01 Vorlon_Backup
  2880  Diff     21,823    32.93 G  OK       16-Aug-10 04:43 Babylon4_Backup
  2886  Diff          1    202.6 M  OK       16-Aug-10 09:43 Catalog_Backup
  2887                0         0   OK       16-Aug-10 10:59 Restore
====

Device status:
Device "FileStorage" (/spool/bacula) is not open.
====

Used Volume status:
====


Director status:

status dir
babylon4-dir Version: 5.0.2 (28 April 2010) i386-pc-solaris2.10 solaris 5.10
Daemon started 06-Aug-10 20:31, 74 Jobs run since started.
 Heap: heap=43,139,072 smbytes=138,175 max_bytes=41,103,221 bufs=760
max_bufs=954
Incremental     Backup           15     17-Aug-10 09:40         Catalog Backup  
        ARCH-0001
Incremental     Backup           10     18-Aug-10 03:30         Babylon4 Backup 
        ARCH-0001
Incremental     Backup           10     18-Aug-10 03:30         Babylon5 Backup 
        ARCH-0001
Incremental     Backup           10     18-Aug-10 03:30         Lake Backup     
        ARCH-0001
Incremental     Backup           10     18-Aug-10 03:30         Mabolgamp 
Backup        ARCH-0001
Incremental     Backup           10     18-Aug-10 03:30         Vorlon Backup   
        ARCH-0001
Incremental     Backup           10     18-Aug-10 03:30         Zocalo Backup   
        ARCH-0001
  2888  Increme Babylon4_Backup.2010-08-17_03.30.00_22  is waiting on max
Storage jobs
  2889  Increme Babylon5_Backup.2010-08-17_03.30.01_23  is waiting on max
Storage jobs
  2890  Increme Lake_Backup.2010-08-17_03.30.01_24      is waiting on max
Storage jobs
  2891  Increme Mabolgamp_Backup.2010-08-17_03.30.01_25 is waiting on max
Storage jobs
  2892  Increme Vorlon_Backup.2010-08-17_03.30.01_26    is waiting on max
Storage jobs
  2893  Increme Zocalo_Backup.2010-08-17_03.30.01_27    is waiting on max
Storage jobs
  2878  Incr           1           197.5 M      OK      15-Aug-10 09:43 
Catalog_Backup
  2879                 0                0       OK      16-Aug-10 03:10
Clean_Expired_Volumes
  2882  Diff         668           407.8 M      OK      16-Aug-10 03:41 
Lake_Backup
  2881  Diff     103,886           1.888 G      OK      16-Aug-10 03:44 
Babylon5_Backup
  2883  Diff      10,890           889.2 M      OK      16-Aug-10 03:45 
Mabolgamp_Backup
  2885  Diff       7,772           3.834 G      OK      16-Aug-10 03:54 
Zocalo_Backup
  2884  Diff       4,856           10.40 G      OK      16-Aug-10 04:02 
Vorlon_Backup
  2880  Diff      21,823           32.93 G      OK      16-Aug-10 04:43 
Babylon4_Backup
  2886  Diff           1           202.6 M      OK      16-Aug-10 09:43 
Catalog_Backup
  2887             3,678           305.2 M      OK      16-Aug-10 10:59 Restore
====


Observe that yesterday morning's Catalog backup, which was supposed to
run to a disk volume on babylon4-sd, is actually queued for volume
ARCH-0001, an LTO2 volume on babylon5-sd.  It was queued to start at
0940, about the time I was running my restore, which used volume
ARCH-0010 on babylon5-sd as well as several disk volumes on babylon4-sd.
 The relevant resources:

Schedule {
  Name = "Monthly Rotation"
  Run = Full 1st mon at 03:30
  Run = Differential 2nd-5th mon at 03:30
  Run = Incremental tue-sun at 03:30
}

Schedule {
  Name = "Catalog"
  Run = Full 1st mon at 09:40
  Run = Differential 2nd-5th mon at 09:40
  Run = Incremental tue-sun at 09:40
}

Storage {
  Name = babylon4-sd
  Address = babylon4.babcom.com
  SDPort = 9103
  Password = "xxxxxxxxxxxxx"
  Device = FileStorage
  Media Type = File
  Maximum Concurrent Jobs = 20
}

Storage {
  Name = babylon5-sd
  Address = babylon5.babcom.com
  SDPort = 9103
  Password = "xxxxxxxxxxxxx"
  Device = Ultrium-LTO2
  Media Type = LTO2
  Maximum Concurrent Jobs = 20
}

JobDefs {
  Name = Backup
  Type = Backup
  Level = Full
  Pool = Full-Tape
  Full Backup Pool = Full-Tape
  Differential Backup Pool = Diff-Disk
  Incremental Backup Pool = Incr-Disk
  Schedule = "Monthly Rotation"
  RunBeforeJob = "/opt/bacula/sbin/checkhost %c"
  Write Bootstrap = "/opt/bacula/working/%c.bsr"
  Rerun Failed Levels = yes
  Messages = Daemon
  Priority = 10
  Allow Duplicate Jobs = no
  Cancel Queued Duplicates = yes
}

Job {
  Name = "Catalog Backup"
  JobDefs = Backup
  FileSet = Catalog
  Schedule = "Catalog"
  Client = babylon4
  RunBeforeJob = "/opt/bacula/etc/dump_catalog"
  RunAfterJob = "/opt/bacula/etc/delete_catalog_dump"
  Write Bootstrap = /opt/bacula/working/CatalogBackup.bsr
  Priority = 15
}


It appears - speculating here - that the Director has somehow become
confused, and because a Restore was running at the time using the
Full-Tape pool on babylon5-sd, mistakenly queued the Catalog backup -
and all subsequent incrementals - that were SUPPOSED to run to the
Incr-Disk pool on babylon4-sd, to the Full-Tape pool on babylon5-sd
instead.  Yet it's not requesting a volume mount for ARCH-0001 on
babylon5-sd; it apparently thinks that babylon5-sd is already at its
20-job concurrency limit, even though no job is actually running on
either babylon5-sd or babylon4-sd right now, and no jobs have in fact
run on either since yesterday morning.

In short, something is really screwed up here.


How can I best capture the state of the daemons so that we can find out
what's causing this bug?


-- 
  Phil Stracchino, CDK#2     DoD#299792458     ICBM: 43.5607, -71.355
  ala...@caerllewys.net   ala...@metrocast.net   p...@co.ordinate.org
         Renaissance Man, Unix ronin, Perl hacker, Free Stater
                 It's not the years, it's the mileage.

------------------------------------------------------------------------------
This SF.net email is sponsored by 

Make an app they can't live without
Enter the BlackBerry Developer Challenge
http://p.sf.net/sfu/RIM-dev2dev 
_______________________________________________
Bacula-devel mailing list
Bacula-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Reply via email to