Currently all of my machines, except for one client, are running Bacula 5.0.2. (One Windows 2000 machine still has a 3.0.1 client.) I have two Storage daemons, one on my main server on its disk array (babylon4-sd), one on this machine on an LTO2 drive (babylon5-sd). The Director and catalog are on babylon4. Monthly Full backups run to the LTO2 drive on babylon5-sd. Weekly differentials and nightly incrementals run to the disk volumes on babylon4-sd. Maximum concurrency is set to 25 on the Director, 20 everywhere else. babylon5 runs Gentoo Linux, babylon4 runs Solaris 10.
I ran my last Full backups on August 3. I last rebooted this machine August 6, and restarted the Director on babylon4 afterwards. All jobs since then have run on babylon4-sd, with no problems, up to and including the differentials that ran early yesterday morning. Yesterday morning, I could not start Firefox on babylon5 due to a corrupted profile, and I had to use both babylon4-sd and babylon5-sd to restore my Firefox profile from (to be safe) two days earlier. Last night's scheduled incrementals did not run. If I look at the Director status right now, all seven jobs are marked as "waiting on max Storage jobs". There are no jobs running on EITHER Storage. As far as I can determine, they appear to be waiting on babylon5-sd, which has no jobs running on it. All daemons are running and responding. Other users have reported running into this bug, but it has not been reproducible on demand. No solution has been found so far except to restart the Director. I had previously speculated that it occurred when a Storage daemon has been restarted more recently than the Director, but in this case, neither Storage daemon has been restarted in the last ten days. Neither Storage daemon is even in use at the moment, but the Director thinks one or the other is at its job limit. Storage daemon status: status storage=babylon5-sd babylon5-sd Version: 5.0.2 (28 April 2010) i686-pc-linux-gnu gentoo Daemon started 06-Aug-10 20:29, 1 Job run since started. Heap: heap=970,752 smbytes=539,241 max_bytes=653,649 bufs=78 max_bufs=206 Sizes: boffset_t=8 size_t=4 int32_t=4 int64_t=8 Running Jobs: No Jobs running. ==== Jobs waiting to reserve a drive: ==== Terminated Jobs: JobId Level Files Bytes Status Finished Name =================================================================== 2773 Full 157,512 1.239 G OK 03-Aug-10 00:33 Spikyfishthing_Backup 2778 Full 468,349 10.04 G OK 03-Aug-10 01:55 Babylon5_Backup 2774 Full 90,615 13.58 G OK 03-Aug-10 02:12 Mabolgamp_Backup 2775 Full 62,888 17.24 G OK 03-Aug-10 02:34 Lake_Backup 2771 Full 123,185 27.62 G OK 03-Aug-10 03:23 Zocalo_Backup 2772 Full 81,133 67.58 G OK 03-Aug-10 04:28 Vorlon_Backup 2777 Full 683,366 450.5 G OK 03-Aug-10 17:42 Babylon4_Backup 2788 Full 1 200.8 M OK 03-Aug-10 18:11 Catalog_Backup 2797 0 0 OK 04-Aug-10 14:12 Restore 2887 0 0 OK 16-Aug-10 10:57 Restore ==== Device status: Device "Ultrium-LTO2" (/dev/nst0) is not open. Device is BLOCKED. User unmounted. ==== Used Volume status: ==== ==== status storage=babylon4-sd babylon4-sd Version: 5.0.2 (28 April 2010) i386-pc-solaris2.10 solaris 5.10 Daemon started 06-Aug-10 21:30, 71 Jobs run since started. Heap: heap=1,196,032 smbytes=541,200 max_bytes=1,099,549 bufs=83 max_bufs=367 Sizes: boffset_t=8 size_t=4 int32_t=4 int64_t=8 Running Jobs: No Jobs running. ==== Jobs waiting to reserve a drive: ==== Terminated Jobs: JobId Level Files Bytes Status Finished Name =================================================================== 2872 Incr 195 816.7 M OK 15-Aug-10 03:50 Babylon4_Backup 2878 Incr 1 197.5 M OK 15-Aug-10 09:43 Catalog_Backup 2882 Diff 668 407.9 M OK 16-Aug-10 03:40 Lake_Backup 2881 Diff 103,886 1.902 G OK 16-Aug-10 03:40 Babylon5_Backup 2883 Diff 10,890 891.8 M OK 16-Aug-10 03:45 Mabolgamp_Backup 2885 Diff 7,772 3.836 G OK 16-Aug-10 03:52 Zocalo_Backup 2884 Diff 4,856 10.40 G OK 16-Aug-10 04:01 Vorlon_Backup 2880 Diff 21,823 32.93 G OK 16-Aug-10 04:43 Babylon4_Backup 2886 Diff 1 202.6 M OK 16-Aug-10 09:43 Catalog_Backup 2887 0 0 OK 16-Aug-10 10:59 Restore ==== Device status: Device "FileStorage" (/spool/bacula) is not open. ==== Used Volume status: ==== Director status: status dir babylon4-dir Version: 5.0.2 (28 April 2010) i386-pc-solaris2.10 solaris 5.10 Daemon started 06-Aug-10 20:31, 74 Jobs run since started. Heap: heap=43,139,072 smbytes=138,175 max_bytes=41,103,221 bufs=760 max_bufs=954 Incremental Backup 15 17-Aug-10 09:40 Catalog Backup ARCH-0001 Incremental Backup 10 18-Aug-10 03:30 Babylon4 Backup ARCH-0001 Incremental Backup 10 18-Aug-10 03:30 Babylon5 Backup ARCH-0001 Incremental Backup 10 18-Aug-10 03:30 Lake Backup ARCH-0001 Incremental Backup 10 18-Aug-10 03:30 Mabolgamp Backup ARCH-0001 Incremental Backup 10 18-Aug-10 03:30 Vorlon Backup ARCH-0001 Incremental Backup 10 18-Aug-10 03:30 Zocalo Backup ARCH-0001 2888 Increme Babylon4_Backup.2010-08-17_03.30.00_22 is waiting on max Storage jobs 2889 Increme Babylon5_Backup.2010-08-17_03.30.01_23 is waiting on max Storage jobs 2890 Increme Lake_Backup.2010-08-17_03.30.01_24 is waiting on max Storage jobs 2891 Increme Mabolgamp_Backup.2010-08-17_03.30.01_25 is waiting on max Storage jobs 2892 Increme Vorlon_Backup.2010-08-17_03.30.01_26 is waiting on max Storage jobs 2893 Increme Zocalo_Backup.2010-08-17_03.30.01_27 is waiting on max Storage jobs 2878 Incr 1 197.5 M OK 15-Aug-10 09:43 Catalog_Backup 2879 0 0 OK 16-Aug-10 03:10 Clean_Expired_Volumes 2882 Diff 668 407.8 M OK 16-Aug-10 03:41 Lake_Backup 2881 Diff 103,886 1.888 G OK 16-Aug-10 03:44 Babylon5_Backup 2883 Diff 10,890 889.2 M OK 16-Aug-10 03:45 Mabolgamp_Backup 2885 Diff 7,772 3.834 G OK 16-Aug-10 03:54 Zocalo_Backup 2884 Diff 4,856 10.40 G OK 16-Aug-10 04:02 Vorlon_Backup 2880 Diff 21,823 32.93 G OK 16-Aug-10 04:43 Babylon4_Backup 2886 Diff 1 202.6 M OK 16-Aug-10 09:43 Catalog_Backup 2887 3,678 305.2 M OK 16-Aug-10 10:59 Restore ==== Observe that yesterday morning's Catalog backup, which was supposed to run to a disk volume on babylon4-sd, is actually queued for volume ARCH-0001, an LTO2 volume on babylon5-sd. It was queued to start at 0940, about the time I was running my restore, which used volume ARCH-0010 on babylon5-sd as well as several disk volumes on babylon4-sd. The relevant resources: Schedule { Name = "Monthly Rotation" Run = Full 1st mon at 03:30 Run = Differential 2nd-5th mon at 03:30 Run = Incremental tue-sun at 03:30 } Schedule { Name = "Catalog" Run = Full 1st mon at 09:40 Run = Differential 2nd-5th mon at 09:40 Run = Incremental tue-sun at 09:40 } Storage { Name = babylon4-sd Address = babylon4.babcom.com SDPort = 9103 Password = "xxxxxxxxxxxxx" Device = FileStorage Media Type = File Maximum Concurrent Jobs = 20 } Storage { Name = babylon5-sd Address = babylon5.babcom.com SDPort = 9103 Password = "xxxxxxxxxxxxx" Device = Ultrium-LTO2 Media Type = LTO2 Maximum Concurrent Jobs = 20 } JobDefs { Name = Backup Type = Backup Level = Full Pool = Full-Tape Full Backup Pool = Full-Tape Differential Backup Pool = Diff-Disk Incremental Backup Pool = Incr-Disk Schedule = "Monthly Rotation" RunBeforeJob = "/opt/bacula/sbin/checkhost %c" Write Bootstrap = "/opt/bacula/working/%c.bsr" Rerun Failed Levels = yes Messages = Daemon Priority = 10 Allow Duplicate Jobs = no Cancel Queued Duplicates = yes } Job { Name = "Catalog Backup" JobDefs = Backup FileSet = Catalog Schedule = "Catalog" Client = babylon4 RunBeforeJob = "/opt/bacula/etc/dump_catalog" RunAfterJob = "/opt/bacula/etc/delete_catalog_dump" Write Bootstrap = /opt/bacula/working/CatalogBackup.bsr Priority = 15 } It appears - speculating here - that the Director has somehow become confused, and because a Restore was running at the time using the Full-Tape pool on babylon5-sd, mistakenly queued the Catalog backup - and all subsequent incrementals - that were SUPPOSED to run to the Incr-Disk pool on babylon4-sd, to the Full-Tape pool on babylon5-sd instead. Yet it's not requesting a volume mount for ARCH-0001 on babylon5-sd; it apparently thinks that babylon5-sd is already at its 20-job concurrency limit, even though no job is actually running on either babylon5-sd or babylon4-sd right now, and no jobs have in fact run on either since yesterday morning. In short, something is really screwed up here. How can I best capture the state of the daemons so that we can find out what's causing this bug? -- Phil Stracchino, CDK#2 DoD#299792458 ICBM: 43.5607, -71.355 ala...@caerllewys.net ala...@metrocast.net p...@co.ordinate.org Renaissance Man, Unix ronin, Perl hacker, Free Stater It's not the years, it's the mileage. ------------------------------------------------------------------------------ This SF.net email is sponsored by Make an app they can't live without Enter the BlackBerry Developer Challenge http://p.sf.net/sfu/RIM-dev2dev _______________________________________________ Bacula-devel mailing list Bacula-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-devel