This document contains the technical details of Bug #395. Bacula bug #935 reports that during a restore, a large number of files are missing and thus not restored. This is really quite surprising because we have a fairly extensive regression test suite that explicitly tests for this kind of problem many times.
Despite our testing, there is indeed a bug in Bacula that has the following characteristics: 1. It happens only when multiple simultaneous Jobs are run (regardless of whether or not data spooling is enabled), and happens only when the Storage daemon is changing from one Volume to another -- i.e. the backups span multiple volumes. 2. It has only been observed on disk based backup, but not on tape. 3. Under the right circumstances (timing), it could and probably does happen on tape backups. 4. It seems to be timing dependent, and requires multiple clients to reproduce, although under the right circumstances, it should be reproducible with a single client doing multiple simultaneous backups. 5. Analysis indicates that it happens most often when the clients are slow (e.g. doing Incremental backups). 6. It has been verified to exist in versions 2.0.x and 2.2.x. 7. It should also be in version 1.38, but could not be reproduced in testing, perhaps due to timing considerations or the fact that the test FD daemons were version 2.2.2. 8. The data is correctly stored on the Volume, but incorrect index (JobMedia) records are stored in the database. (the JobMedia record generated during the Volume change contains the index of the new Volume rather than the previous Volume). This will be described in more detail below. 9. You can prevent the problem from occurring by either turning off multiple simultaneous Jobs or by ensuring that while running multiple simultaneous Jobs that those Jobs do not span Volumes. E.g. you could manually mark Volumes as full when they are sufficiently large. 10. If you are not running multiple simultaneous Jobs, you will not be affected by this bug. 11. If you are running multiple simultaneous Jobs to tapes, I believe there is a reasonable probability that this problem could show up when Jobs are split across tapes. 12. If you are running multiple simultaneous Jobs to disks, I believe there is a high probability that this problem will show up when Jobs are split across disks Volumes. =============================== The problem comes from the fact that when the end of a Volume is reached, the SD must generate a JobMedia (index) record for each of the Jobs that is currently running. Since each job is in a separate thread, the thread that does the Volume switch marks all the other threads (Jobs) with a flag that tell them to update the catalog index (JobMedia). Sometime later, when that thread attempts to do another write to the volume, it will create a JobMedia record. This all works pefectly well, with one silly exception that was overlooked and that is the Volume Id (MediaId) may have changed by the time the Job gets around to generating the JobMedia record, and the code in all versions prior to 2.2.3 use a "global" Volume index, which could point to the new volume rather than the old volume. This doesn't seem to happen very often, otherwise it would have shown up in our regression tests. I suspect that it happens when the SD is very busy (has lots of clients feeding it) because at that point the Volume can be changed before some of the Jobs get around to trying to write again. The result of the above is that if Bacula writes to Tape001, then switches to write to Tape002, any Job that doesn't update his index files (JobMedia record) before the new Volume starts to write will have the last group of files marked as being on Volume Tape002 rather than Tape001, and when the restore runs, it obviously will not find the files. The solution to the problem was quite simple: when the SD is doing its record keeping (saving the current file locations) into a local copy it must also save the current volume id, then when the JobMedia record is written it will point to the right volume. Perhaps with time we will have a better idea how to detect and correct this problem, but for the moment, I can only show you some examples of failed jobs and sucessful jobs and give some indications how to "manually" recover the files. First the output showing a failure: ===================== Failure ========================== 10-Sep 08:21 rufus-dir: Start Backup JobId 1, Job=roxie.2007-09-10_08.21.21 10-Sep 08:21 rufus-dir: Using Device "FileStorage1" 10-Sep 08:21 rufus-sd: Wrote label to prelabeled Volume "TestVolume001" on device "FileStorage1" (/home/kern/bacula/xregress/tmp) 10-Sep 08:21 rufus-dir: Start Backup JobId 2, Job=lmatou.2007-09-10_08.21.22 10-Sep 08:21 rufus-dir: Using Device "FileStorage1" 10-Sep 08:21 rufus-dir: Start Backup JobId 3, Job=Multi-client.2007-09-10_08.21.23 10-Sep 08:21 rufus-dir: Using Device "FileStorage1" 10-Sep 08:21 rufus-sd: Job write elapsed time = 00:00:23, Transfer rate = 5.572 M bytes/second 10-Sep 08:21 rufus-dir: Bacula rufus-dir 2.2.2 (06Aug07): 10-Sep-2007 08:21:47 Build OS: i686-pc-linux-gnu redhat JobId: 1 Job: roxie.2007-09-10_08.21.21 Backup Level: Full Client: "roxie-fd" 2.2.1 (30Aug07) i686-pc-linux-gnu,suse,10.2 FileSet: "roxie-FileSet" 2007-09-10 08:21:21 Pool: "Default" (From Job resource) Storage: "File1" (From command line) Scheduled time: 10-Sep-2007 08:21:21 Start time: 10-Sep-2007 08:21:22 End time: 10-Sep-2007 08:21:47 Elapsed time: 25 secs Priority: 10 FD Files Written: 7,400 SD Files Written: 7,400 FD Bytes Written: 127,110,275 (127.1 MB) SD Bytes Written: 128,178,908 (128.1 MB) Rate: 5084.4 KB/s Software Compression: None VSS: no Encryption: no Volume name(s): TestVolume001 Volume Session Id: 1 Volume Session Time: 1189405277 Last Volume Bytes: 302,945,847 (302.9 MB) Non-fatal FD errors: 0 SD Errors: 0 FD termination status: OK SD termination status: OK Termination: Backup OK 10-Sep 08:21 rufus-dir: Begin pruning Jobs. 10-Sep 08:21 rufus-dir: No Jobs found to prune. 10-Sep 08:21 rufus-dir: Begin pruning Files. 10-Sep 08:21 rufus-dir: No Files found to prune. 10-Sep 08:21 rufus-dir: End auto prune. 10-Sep 08:22 rufus-sd: User defined maximum volume capacity 900,000,000 exceeded on device "FileStorage1" (/home/kern/bacula/xregress/tmp). 10-Sep 08:22 rufus-sd: End of medium on Volume "TestVolume001" Bytes=899,939,860 Blocks=13,950 at 10-Sep-2007 08:22. 10-Sep 08:22 rufus-sd: Wrote label to prelabeled Volume "TestVolume002" on device "FileStorage1" (/home/kern/bacula/xregress/tmp) 10-Sep 08:22 rufus-sd: New volume "TestVolume002" mounted on device "FileStorage1" (/home/kern/bacula/xregress/tmp) at 10-Sep-2007 08:22. 10-Sep 08:22 rufus-sd: Job write elapsed time = 00:01:20, Transfer rate = 7.982 M bytes/second 10-Sep 08:22 rufus-dir: Bacula rufus-dir 2.2.2 (06Aug07): 10-Sep-2007 08:22:51 Build OS: i686-pc-linux-gnu redhat JobId: 2 Job: lmatou.2007-09-10_08.21.22 Backup Level: Full Client: "lmatou-fd" 2.2.0 (08Aug07) i686-pc-linux-gnu,redhat, FileSet: "lmatou-FileSet" 2007-09-10 08:21:22 Pool: "Default" (From Job resource) Storage: "File1" (From command line) Scheduled time: 10-Sep-2007 08:21:21 Start time: 10-Sep-2007 08:21:23 End time: 10-Sep-2007 08:22:51 Elapsed time: 1 min 28 secs Priority: 10 FD Files Written: 15,236 SD Files Written: 15,236 FD Bytes Written: 636,426,379 (636.4 MB) SD Bytes Written: 638,628,613 (638.6 MB) Rate: 7232.1 KB/s Software Compression: None VSS: no Encryption: no Volume name(s): TestVolume001|TestVolume002 Volume Session Id: 2 Volume Session Time: 1189405277 Last Volume Bytes: 222,311,478 (222.3 MB) Non-fatal FD errors: 0 SD Errors: 0 FD termination status: OK SD termination status: OK Termination: Backup OK 10-Sep 08:22 rufus-dir: Begin pruning Jobs. 10-Sep 08:22 rufus-dir: No Jobs found to prune. 10-Sep 08:22 rufus-dir: Begin pruning Files. 10-Sep 08:22 rufus-dir: No Files found to prune. 10-Sep 08:22 rufus-dir: End auto prune. 10-Sep 08:23 rufus-sd: Job write elapsed time = 00:01:33, Transfer rate = 4.786 M bytes/second 10-Sep 08:23 rufus-dir: Bacula rufus-dir 2.2.2 (06Aug07): 10-Sep-2007 08:23:02 Build OS: i686-pc-linux-gnu redhat JobId: 3 Job: Multi-client.2007-09-10_08.21.23 Backup Level: Full Client: "rufus-fd" 2.2.2 (06Aug07) i686-pc-linux-gnu,redhat, FileSet: "Full Set" 2007-09-10 08:21:23 Pool: "Default" (From Job resource) Storage: "File1" (From command line) Scheduled time: 10-Sep-2007 08:21:22 Start time: 10-Sep-2007 08:21:27 End time: 10-Sep-2007 08:23:02 Elapsed time: 1 min 35 secs Priority: 10 FD Files Written: 6,696 SD Files Written: 6,696 FD Bytes Written: 444,252,164 (444.2 MB) SD Bytes Written: 445,181,110 (445.1 MB) Rate: 4676.3 KB/s Software Compression: None VSS: no Encryption: no Volume name(s): TestVolume002 Volume Session Id: 3 Volume Session Time: 1189405277 Last Volume Bytes: 313,858,029 (313.8 MB) Non-fatal FD errors: 0 SD Errors: 0 FD termination status: OK SD termination status: OK Termination: Backup OK 10-Sep 08:23 rufus-dir: Begin pruning Jobs. 10-Sep 08:23 rufus-dir: No Jobs found to prune. 10-Sep 08:23 rufus-dir: Begin pruning Files. 10-Sep 08:23 rufus-dir: No Files found to prune. 10-Sep 08:23 rufus-dir: End auto prune. @# @# now do a restore @# @tee tmp/log2.out restore where=/home/kern/bacula/xregress/tmp/bacula-restores client=rufus-fd select storage=File1 Automatically selected FileSet: Full Set +-------+-------+----------+-------------+---------------------+---------------+ | JobId | Level | JobFiles | JobBytes | StartTime | VolumeName | +-------+-------+----------+-------------+---------------------+---------------+ | 3 | F | 6,696 | 444,252,164 | 2007-09-10 08:21:27 | TestVolume002 | +-------+-------+----------+-------------+---------------------+---------------+ You have selected the following JobId: 3 Building directory tree for JobId 3 ... +++++++++++++++++++++++++++++++++++++++++++++ 1 Job, 6,078 files inserted into the tree. You are now entering file selection mode where you add (mark) and remove (unmark) files to be restored. No files are initially added, unless you used the "all" keyword on the command line. Enter "done" to leave this mode. cwd is: / $ unmark * 6,696 files unmarked. $ mark * 6,696 files marked. $ done Bootstrap records written to /home/kern/bacula/xregress/working/rufus-dir.restore.1.bsr The job will require the following Volume(s) Storage(s) SD Device(s) =========================================================================== TestVolume002 File1 FileStorage1 6,696 files selected to be restored. Run Restore job JobName: RestoreFiles Bootstrap: /home/kern/bacula/xregress/working/rufus-dir.restore.1.bsr Where: /home/kern/bacula/xregress/tmp/bacula-restores Replace: always FileSet: Full Set Backup Client: rufus-fd Restore Client: rufus-fd Storage: File1 When: 2007-09-10 08:23:03 Catalog: MyCatalog Priority: 10 OK to run? (yes/mod/no): yes Job queued. JobId=4 wait You have messages. messages 10-Sep 08:23 rufus-dir: Start Restore Job RestoreFiles.2007-09-10_08.23.03 10-Sep 08:23 rufus-dir: Using Device "FileStorage1" 10-Sep 08:23 rufus-sd: Ready to read from volume "TestVolume002" on device "FileStorage1" (/home/kern/bacula/xregress/tmp). 10-Sep 08:23 rufus-sd: Forward spacing Volume "TestVolume002" to file:block 0:64724. 10-Sep 08:23 rufus-fd: RestoreFiles.2007-09-10_08.23.03 Warning: Cannot change owner and/or group of /home/kern/bacula/xregress/tmp/bacula-restores/home/kern/bacula/k/src/win32/stored/: ERR=Operation not permitted 10-Sep 08:23 rufus-fd: RestoreFiles.2007-09-10_08.23.03 Error: attribs.c:442 Unable to set file owner /home/kern/bacula/xregress/tmp/bacula-restores/home/kern/bacula/k/: ERR=Operation not permitted [ ... snip ... ] 10-Sep 08:23 rufus-dir: RestoreFiles.2007-09-10_08.23.03 Error: Bacula rufus-dir 2.2.2 (06Aug07): 10-Sep-2007 08:23:23 Build OS: i686-pc-linux-gnu redhat JobId: 4 Job: RestoreFiles.2007-09-10_08.23.03 Restore Client: rufus-fd Start time: 10-Sep-2007 08:23:05 End time: 10-Sep-2007 08:23:23 Files Expected: 6,696 Files Restored: 2,525 Bytes Restored: 188,395,392 Rate: 10466.4 KB/s FD Errors: 240 FD termination status: Error SD termination status: OK Termination: *** Restore Error *** 10-Sep 08:23 rufus-dir: Begin pruning Jobs. 10-Sep 08:23 rufus-dir: No Jobs found to prune. 10-Sep 08:23 rufus-dir: Begin pruning Files. 10-Sep 08:23 rufus-dir: No Files found to prune. 10-Sep 08:23 rufus-dir: End auto prune. !!!!! multi-client-test failed!!! !!!!! !!!!! Bad Job termination status !!!!! bstat=0 rstat=1 dstat=0 ================= end failure example ================== Now what happened above is that thre were three backup jobs run. JobId 3 Multi-client.2007-09-10_08.21.23 shows a successful backup to TestVolume002. However, it *actually* wrote to TestVolume001 and TestVolume002. When the restore is done, only TestVolume002 appears in the output, and only the files that were on TestVolume002 were restored. This is because *all* the JobMedia records indicate they were written on TestVolume002. Now the output from the same test that succeeds (after the patch is applied) looks like: ==================== begin successful backup/restore ============= 10-Sep 12:37 rufus-dir: Start Backup JobId 1, Job=roxie.2007-09-10_12.37.07 10-Sep 12:37 rufus-dir: Using Device "FileStorage1" 10-Sep 12:37 rufus-sd: Wrote label to prelabeled Volume "TestVolume001" on device "FileStorage1" (/home/kern/bacula/regress/tmp) 10-Sep 12:37 rufus-dir: Start Backup JobId 2, Job=lmatou.2007-09-10_12.37.08 10-Sep 12:37 rufus-dir: Start Backup JobId 3, Job=Multi-client.2007-09-10_12.37.09 10-Sep 12:37 rufus-dir: Using Device "FileStorage1" 10-Sep 12:37 rufus-dir: Using Device "FileStorage1" 10-Sep 12:37 rufus-sd: Job write elapsed time = 00:00:25, Transfer rate = 5.127 M bytes/second 10-Sep 12:37 rufus-dir: Bacula rufus-dir 2.3.3 (06Sep07): 10-Sep-2007 12:37:35 Build OS: i686-pc-linux-gnu redhat JobId: 1 Job: roxie.2007-09-10_12.37.07 Backup Level: Full Client: "roxie-fd" 2.2.1 (30Aug07) i686-pc-linux-gnu,suse,10.2 FileSet: "roxie-FileSet" 2007-09-10 12:37:07 Pool: "Default" (From Job resource) Storage: "File1" (From command line) Scheduled time: 10-Sep-2007 12:37:06 Start time: 10-Sep-2007 12:37:08 End time: 10-Sep-2007 12:37:35 Elapsed time: 27 secs Priority: 10 FD Files Written: 7,400 SD Files Written: 7,400 FD Bytes Written: 127,110,275 (127.1 MB) SD Bytes Written: 128,178,908 (128.1 MB) Rate: 4707.8 KB/s Software Compression: None VSS: no Encryption: no Volume name(s): TestVolume001 Volume Session Id: 1 Volume Session Time: 1189420623 Last Volume Bytes: 363,458,103 (363.4 MB) Non-fatal FD errors: 0 SD Errors: 0 FD termination status: OK SD termination status: OK Termination: Backup OK 10-Sep 12:37 rufus-dir: Begin pruning Jobs. 10-Sep 12:37 rufus-dir: No Jobs found to prune. 10-Sep 12:37 rufus-dir: Begin pruning Files. 10-Sep 12:37 rufus-dir: No Files found to prune. 10-Sep 12:37 rufus-dir: End auto prune. 10-Sep 12:38 rufus-sd: User defined maximum volume capacity 900,000,000 exceeded on device "FileStorage1" (/home/kern/bacula/regress/tmp). 10-Sep 12:38 rufus-sd: End of medium on Volume "TestVolume001" Bytes=899,939,867 Blocks=13,950 at 10-Sep-2007 12:38. 10-Sep 12:38 rufus-sd: Wrote label to prelabeled Volume "TestVolume002" on device "FileStorage1" (/home/kern/bacula/regress/tmp) 10-Sep 12:38 rufus-sd: New volume "TestVolume002" mounted on device "FileStorage1" (/home/kern/bacula/regress/tmp) at 10-Sep-2007 12:38. 10-Sep 12:38 rufus-sd: Job write elapsed time = 00:01:24, Transfer rate = 7.602 M bytes/second 10-Sep 12:38 rufus-dir: Bacula rufus-dir 2.3.3 (06Sep07): 10-Sep-2007 12:38:40 Build OS: i686-pc-linux-gnu redhat JobId: 2 Job: lmatou.2007-09-10_12.37.08 Backup Level: Full Client: "lmatou-fd" 2.2.0 (08Aug07) i686-pc-linux-gnu,redhat, FileSet: "lmatou-FileSet" 2007-09-10 12:37:08 Pool: "Default" (From Job resource) Storage: "File1" (From command line) Scheduled time: 10-Sep-2007 12:37:07 Start time: 10-Sep-2007 12:37:09 End time: 10-Sep-2007 12:38:40 Elapsed time: 1 min 31 secs Priority: 10 FD Files Written: 15,236 SD Files Written: 15,236 FD Bytes Written: 636,426,379 (636.4 MB) SD Bytes Written: 638,628,613 (638.6 MB) Rate: 6993.7 KB/s Software Compression: None VSS: no Encryption: no Volume name(s): TestVolume001|TestVolume002 Volume Session Id: 2 Volume Session Time: 1189420623 Last Volume Bytes: 274,050,095 (274.0 MB) Non-fatal FD errors: 0 SD Errors: 0 FD termination status: OK SD termination status: OK Termination: Backup OK 10-Sep 12:38 rufus-dir: Begin pruning Jobs. 10-Sep 12:38 rufus-dir: No Jobs found to prune. 10-Sep 12:38 rufus-dir: Begin pruning Files. 10-Sep 12:38 rufus-dir: No Files found to prune. 10-Sep 12:38 rufus-dir: End auto prune. 10-Sep 12:38 rufus-sd: Job write elapsed time = 00:01:38, Transfer rate = 4.542 M bytes/second 10-Sep 12:38 rufus-dir: Bacula rufus-dir 2.3.3 (06Sep07): 10-Sep-2007 12:38:48 Build OS: i686-pc-linux-gnu redhat JobId: 3 Job: Multi-client.2007-09-10_12.37.09 Backup Level: Full Client: "rufus-fd" 2.3.3 (06Sep07) i686-pc-linux-gnu,redhat, FileSet: "Full Set" 2007-09-10 12:37:09 Pool: "Default" (From Job resource) Storage: "File1" (From command line) Scheduled time: 10-Sep-2007 12:37:08 Start time: 10-Sep-2007 12:37:09 End time: 10-Sep-2007 12:38:48 Elapsed time: 1 min 39 secs Priority: 10 FD Files Written: 6,696 SD Files Written: 6,696 FD Bytes Written: 444,252,164 (444.2 MB) SD Bytes Written: 445,181,110 (445.1 MB) Rate: 4487.4 KB/s Software Compression: None VSS: no Encryption: no Volume name(s): TestVolume001|TestVolume002 Volume Session Id: 3 Volume Session Time: 1189420623 Last Volume Bytes: 313,858,022 (313.8 MB) Non-fatal FD errors: 0 SD Errors: 0 FD termination status: OK SD termination status: OK Termination: Backup OK 10-Sep 12:38 rufus-dir: Begin pruning Jobs. 10-Sep 12:38 rufus-dir: No Jobs found to prune. 10-Sep 12:38 rufus-dir: Begin pruning Files. 10-Sep 12:38 rufus-dir: No Files found to prune. 10-Sep 12:38 rufus-dir: End auto prune. @# @# now do a restore @# @tee tmp/log2.out restore where=/home/kern/bacula/regress/tmp/bacula-restores client=rufus-fd select storage=File1 Automatically selected FileSet: Full Set +-------+-------+----------+-------------+---------------------+---------------+ | JobId | Level | JobFiles | JobBytes | StartTime | VolumeName | +-------+-------+----------+-------------+---------------------+---------------+ | 3 | F | 6,696 | 444,252,164 | 2007-09-10 12:37:09 | TestVolume001 | | 3 | F | 6,696 | 444,252,164 | 2007-09-10 12:37:09 | TestVolume002 | +-------+-------+----------+-------------+---------------------+---------------+ You have selected the following JobId: 3 Building directory tree for JobId 3 ... +++++++++++++++++++++++++++++++++++++++++++++ 1 Job, 6,078 files inserted into the tree. You are now entering file selection mode where you add (mark) and remove (unmark) files to be restored. No files are initially added, unless you used the "all" keyword on the command line. Enter "done" to leave this mode. cwd is: / $ unmark * 6,696 files unmarked. $ mark * 6,696 files marked. $ done Bootstrap records written to /home/kern/bacula/regress/working/rufus-dir.restore.1.bsr The job will require the following Volume(s) Storage(s) SD Device(s) =========================================================================== TestVolume001 File1 FileStorage1 TestVolume002 File1 FileStorage1 6,696 files selected to be restored. Run Restore job JobName: RestoreFiles Bootstrap: /home/kern/bacula/regress/working/rufus-dir.restore.1.bsr Where: /home/kern/bacula/regress/tmp/bacula-restores Replace: always FileSet: Full Set Backup Client: rufus-fd Restore Client: rufus-fd Storage: File1 When: 2007-09-10 12:38:49 Catalog: MyCatalog Priority: 10 OK to run? (yes/mod/no): yes Job queued. JobId=4 wait You have messages. messages 10-Sep 12:38 rufus-dir: Start Restore Job RestoreFiles.2007-09-10_12.38.49 10-Sep 12:38 rufus-dir: Using Device "FileStorage1" 10-Sep 12:38 rufus-sd: Ready to read from volume "TestVolume001" on device "FileStorage1" (/home/kern/bacula/regress/tmp). 10-Sep 12:38 rufus-sd: Forward spacing Volume "TestVolume001" to file:block 0:10644692. 10-Sep 12:39 rufus-sd: End of file 0 on device "FileStorage1" (/home/kern/bacula/regress/tmp), Volume "TestVolume001" 10-Sep 12:39 rufus-sd: End of Volume at file 0 on device "FileStorage1" (/home/kern/bacula/regress/tmp), Volume "TestVolume001" 10-Sep 12:39 rufus-sd: Ready to read from volume "TestVolume002" on device "FileStorage1" (/home/kern/bacula/regress/tmp). 10-Sep 12:39 rufus-sd: Forward spacing Volume "TestVolume002" to file:block 0:212. 10-Sep 12:39 rufus-sd: End of file 0 on device "FileStorage1" (/home/kern/bacula/regress/tmp), Volume "TestVolume002" 10-Sep 12:39 rufus-sd: End of Volume at file 0 on device "FileStorage1" (/home/kern/bacula/regress/tmp), Volume "TestVolume002" 10-Sep 12:39 rufus-sd: End of all volumes. 10-Sep 12:39 rufus-dir: Bacula rufus-dir 2.3.3 (06Sep07): 10-Sep-2007 12:39:52 Build OS: i686-pc-linux-gnu redhat JobId: 4 Job: RestoreFiles.2007-09-10_12.38.49 Restore Client: rufus-fd Start time: 10-Sep-2007 12:38:51 End time: 10-Sep-2007 12:39:52 Files Expected: 6,696 Files Restored: 6,696 Bytes Restored: 444,252,164 Rate: 7282.8 KB/s FD Errors: 0 FD termination status: OK SD termination status: OK Termination: Restore OK 10-Sep 12:39 rufus-dir: Begin pruning Jobs. 10-Sep 12:39 rufus-dir: No Jobs found to prune. 10-Sep 12:39 rufus-dir: Begin pruning Files. 10-Sep 12:39 rufus-dir: No Files found to prune. 10-Sep 12:39 rufus-dir: End auto prune. ==================== end successful backup/restore ========== Now if we dump the JobMedia records for the Job that failed to restore properly, we get: Enter SQL query: select * from JobMedia where JobId=3; +------------+-------+---------+------------+-----------+-----------+---------+------------+-------------+----------+------+--------+ | JobMediaId | JobId | MediaId | FirstIndex | LastIndex | StartFile | EndFile | StartBlock | EndBlock | VolIndex | Copy | Stripe | +------------+-------+---------+------------+-----------+-----------+---------+------------+-------------+----------+------+--------+ | 2 | 3 | 2 | 1 | 4,324 | 0 | 0 | 57,867,465 | 899,939,859 | 1 | 0 | 0 | | 5 | 3 | 2 | 4,324 | 6,696 | 0 | 0 | 212 | 313,858,028 | 2 | 0 | 0 | +------------+-------+---------+------------+-----------+-----------+---------+------------+-------------+----------+------+--------+ The key is to note that the MediaId is 2 for both of the records. That is perfectly legal, but in this case, the first record was actually written on MediaId=1. It was just incorrectly recorded. Below is an example from the same test that succeeded. Note, the first record has MediaId=1 and the second has MediaId=2. Some of the details of the start and end block numbers are different, but that just shows that the timing was differnent. Enter SQL query: select * from JobMedia where JobId=3; +------------+-------+---------+------------+-----------+-----------+---------+------------+-------------+----------+------+--------+ | JobMediaId | JobId | MediaId | FirstIndex | LastIndex | StartFile | EndFile | StartBlock | EndBlock | VolIndex | Copy | Stripe | +------------+-------+---------+------------+-----------+-----------+---------+------------+-------------+----------+------+--------+ | 3 | 3 | 1 | 1 | 4,198 | 0 | 0 | 46,771,401 | 899,230,227 | 1 | 0 | 0 | | 5 | 3 | 2 | 4,198 | 6,696 | 0 | 0 | 64,724 | 313,858,028 | 2 | 0 | 0 | +------------+-------+---------+------------+-----------+-----------+---------+------------+-------------+----------+------+--------+ One aspect of this is to notice that if you sort the JobMedia records by JobMediaId as they are in the listing, for any given MediaId, the records *must* have increasing StartBlock and EndBlock and they must not overlap. In the case of the successful Job, the MediaIds are different, so we cannot compare Start and End Block numbers, but in the case of the failed Job, we see that there are two JobMedia records for the same MediaId (perfectly OK), *but* the first one has a larger Start block than the second record, which is not possible. Now obviously, the simplest thing to do is to detect this condition and correct the MediaId. However, this is not always so easy, because the Job output does not *clearly* show what Volumes were written for what job, though it can be deduced from looking at the full output, knowing when jobs start and stop and looking at the times when Bacula changes Volumes. In addition, there may be a very large number of JobMedia records if it is a big job. Now for a look at the bsr files: ======= Unsuccessful bsr file ================== # 10-Sep-2007 12:47:51 - Multi-client.2007-09-10_12.46.29 - Full Volume="TestVolume002" MediaType="File1" VolSessionId=3 VolSessionTime=1189421182 VolFile=0-0 VolBlock=57867465-899939866 FileIndex=1-4337 Volume="TestVolume002" MediaType="File1" VolSessionId=3 VolSessionTime=1189421182 VolFile=0-0 VolBlock=64724-313858021 FileIndex=4337-6696 ================================================ ======== Successful bsr file ================ # 10-Sep-2007 12:38:48 - Multi-client.2007-09-10_12.37.09 - Full Volume="TestVolume001" MediaType="File1" VolSessionId=3 VolSessionTime=1189420623 VolFile=0-0 VolBlock=10644692-899939866 FileIndex=1-4344 Volume="TestVolume002" MediaType="File1" VolSessionId=3 VolSessionTime=1189420623 VolFile=0-0 VolBlock=212-313858021 FileIndex=4344-6696 ============================================ What is key in looking at these two files is that first and incorrect restore lists TestVolume002 twice, whereas the second and successful restore lists TestVolume001 and TestVolume002. So, if you understand the problem, you can also patch things up for the restore by manually editing the .bsr file when you get to the yes/mod/no prompt before you answer yes. For example, in the unsucceful restore bsr file shown above, if I modify the second line of the file to read: Volume="TestVolume001" then run the restore, it will work. Now obviously this is much more complicated if there are a lot of volumes involved ... === Another way of correcting the problem, may be to run bscan. However, there are a number of aspects of bscanning that I don't like: 1. If you use the old bscan, it may fail the same way the original backup failed to write correct JobMedia records (I haven't tested it). 2. Bscanning is slow. 3. Bscan does not restore *all* the info in the catalog, so you will be left with a catalog that is so-so. 4. Bscanning with a 2.2.3 version may work, but it is as of this moment it is untested. If you want to go the bscan route, I recommend saving a copy of your database (just in case), but then instead of purging your database and recreating it with bscan, I would simply create a different database and if it works, use that one for restores. Finally, if bscan does work, it may be possible to purge just the job that fails to restore correctly, then bscanning only that job (probably using a .bsr file). This minimize the loss of information stored in your database. This has not been tried yet. === Before investigating bscan in more detail, I will release 2.2.3 and work on getting a patch for 2.0.x systems. This document can be found at www.bacula.org/downloads/bug-395.txt and will be updated as the situation progresses. Regards, Kern ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users