[Bacula-users] Technical aspects of the restore bug

Kern Sibbald Mon, 10 Sep 2007 04:58:34 -0700

       This document contains the technical details of Bug #395.

Bacula bug #935 reports that during a restore, a large number of files are 
missing and thus not restored.  This is really quite surprising because we 
have a fairly extensive regression test suite that explicitly tests for this 
kind of problem many times.


Despite our testing, there is indeed a bug in Bacula that has the following 
characteristics:

1. It happens only when multiple simultaneous Jobs are run (regardless of 
whether or not data spooling is enabled), and happens only when the 
Storage daemon is changing from one Volume to another -- i.e. the
backups span multiple volumes.

2. It has only been observed on disk based backup, but not on tape. 

3. Under the right circumstances (timing), it could and probably does happen 
on tape backups.

4. It seems to be timing dependent, and requires multiple clients to 
reproduce, although under the right circumstances, it should be reproducible
with a single client doing multiple simultaneous backups.

5. Analysis indicates that it happens most often when the clients are slow 
(e.g. doing Incremental backups).

6. It has been verified to exist in versions 2.0.x and 2.2.x.

7. It should also be in version 1.38, but could not be reproduced in testing, 
perhaps due to timing considerations or the fact that the test FD daemons 
were version 2.2.2.

8. The data is correctly stored on the Volume, but incorrect index (JobMedia) 
records are stored in the database.  (the JobMedia record generated during 
the Volume change contains the index of the new Volume rather than the 
previous Volume).  This will be described in more detail below.

9. You can prevent the problem from occurring by either turning off multiple 
simultaneous Jobs or by ensuring that while running multiple simultaneous 
Jobs that those Jobs do not span Volumes.  E.g. you could manually mark 
Volumes as full when they are sufficiently large.

10. If you are not running multiple simultaneous Jobs, you will not be 
affected by this bug.

11. If you are running multiple simultaneous Jobs to tapes, I believe there is 
a reasonable probability that this problem could show up when Jobs are split 
across tapes.

12. If you are running multiple simultaneous Jobs to disks, I believe there is 
a high probability that this problem will show up when Jobs are split across 
disks Volumes.

===============================

The problem comes from the fact that when the end of a Volume is reached,
the SD must generate a JobMedia (index) record for each of the Jobs that is
currently running. Since each job is in a separate thread, the thread that
does the Volume switch marks all the other threads (Jobs) with a flag
that tell them to update the catalog index (JobMedia).  Sometime later,
when that thread attempts to do another write to the volume, it will
create a JobMedia record.  

This all works pefectly well, with one silly exception that was overlooked
and that is the Volume Id (MediaId) may have changed by the time the Job
gets around to generating the JobMedia record, and the code in all 
versions prior to 2.2.3 use a "global" Volume index, which could point
to the new volume rather than the old volume. This doesn't seem to happen
very often, otherwise it would have shown up in our regression tests. I       
suspect that it happens when the SD is very busy (has lots of clients
feeding it) because at that point the Volume can be changed before some
of the Jobs get around to trying to write again.  

The result of the above is that if Bacula writes to Tape001, then switches
to write to Tape002, any Job that doesn't update his index files (JobMedia
record) before the new Volume starts to write will have the last group
of files marked as being on Volume Tape002 rather than Tape001, and 
when the restore runs, it obviously will not find the files.

The solution to the problem was quite simple: when the SD is doing its
record keeping (saving the current file locations) into a local copy 
it must also save the current volume id, then when the JobMedia record
is written it will point to the right volume.

Perhaps with time we will have a better idea how to detect and correct 
this problem, but for the moment, I can only show you some examples of
failed jobs and sucessful jobs and give some indications how to "manually"
recover the files.

First the output showing a failure:

===================== Failure ==========================
10-Sep 08:21 rufus-dir: Start Backup JobId 1, Job=roxie.2007-09-10_08.21.21
10-Sep 08:21 rufus-dir: Using Device "FileStorage1"
10-Sep 08:21 rufus-sd: Wrote label to prelabeled Volume "TestVolume001" on 
device "FileStorage1" (/home/kern/bacula/xregress/tmp)
10-Sep 08:21 rufus-dir: Start Backup JobId 2, Job=lmatou.2007-09-10_08.21.22
10-Sep 08:21 rufus-dir: Using Device "FileStorage1"
10-Sep 08:21 rufus-dir: Start Backup JobId 3, 
Job=Multi-client.2007-09-10_08.21.23
10-Sep 08:21 rufus-dir: Using Device "FileStorage1"
10-Sep 08:21 rufus-sd: Job write elapsed time = 00:00:23, Transfer rate = 
5.572 M bytes/second
10-Sep 08:21 rufus-dir: Bacula rufus-dir 2.2.2 (06Aug07): 10-Sep-2007 08:21:47
  Build OS:               i686-pc-linux-gnu redhat 
  JobId:                  1
  Job:                    roxie.2007-09-10_08.21.21
  Backup Level:           Full
  Client:                 "roxie-fd" 2.2.1 (30Aug07) 
i686-pc-linux-gnu,suse,10.2
  FileSet:                "roxie-FileSet" 2007-09-10 08:21:21
  Pool:                   "Default" (From Job resource)
  Storage:                "File1" (From command line)
  Scheduled time:         10-Sep-2007 08:21:21
  Start time:             10-Sep-2007 08:21:22
  End time:               10-Sep-2007 08:21:47
  Elapsed time:           25 secs
  Priority:               10
  FD Files Written:       7,400
  SD Files Written:       7,400
  FD Bytes Written:       127,110,275 (127.1 MB)
  SD Bytes Written:       128,178,908 (128.1 MB)
  Rate:                   5084.4 KB/s
  Software Compression:   None
  VSS:                    no
  Encryption:             no
  Volume name(s):         TestVolume001
  Volume Session Id:      1
  Volume Session Time:    1189405277
  Last Volume Bytes:      302,945,847 (302.9 MB)
  Non-fatal FD errors:    0
  SD Errors:              0
  FD termination status:  OK
  SD termination status:  OK
  Termination:            Backup OK

10-Sep 08:21 rufus-dir: Begin pruning Jobs.
10-Sep 08:21 rufus-dir: No Jobs found to prune.
10-Sep 08:21 rufus-dir: Begin pruning Files.
10-Sep 08:21 rufus-dir: No Files found to prune.
10-Sep 08:21 rufus-dir: End auto prune.

10-Sep 08:22 rufus-sd: User defined maximum volume capacity 900,000,000 
exceeded on device "FileStorage1" (/home/kern/bacula/xregress/tmp).
10-Sep 08:22 rufus-sd: End of medium on Volume "TestVolume001" 
Bytes=899,939,860 Blocks=13,950 at 10-Sep-2007 08:22.
10-Sep 08:22 rufus-sd: Wrote label to prelabeled Volume "TestVolume002" on 
device "FileStorage1" (/home/kern/bacula/xregress/tmp)
10-Sep 08:22 rufus-sd: New volume "TestVolume002" mounted on 
device "FileStorage1" (/home/kern/bacula/xregress/tmp) at 10-Sep-2007 08:22.
10-Sep 08:22 rufus-sd: Job write elapsed time = 00:01:20, Transfer rate = 
7.982 M bytes/second
10-Sep 08:22 rufus-dir: Bacula rufus-dir 2.2.2 (06Aug07): 10-Sep-2007 08:22:51
  Build OS:               i686-pc-linux-gnu redhat 
  JobId:                  2
  Job:                    lmatou.2007-09-10_08.21.22
  Backup Level:           Full
  Client:                 "lmatou-fd" 2.2.0 (08Aug07) 
i686-pc-linux-gnu,redhat,
  FileSet:                "lmatou-FileSet" 2007-09-10 08:21:22
  Pool:                   "Default" (From Job resource)
  Storage:                "File1" (From command line)
  Scheduled time:         10-Sep-2007 08:21:21
  Start time:             10-Sep-2007 08:21:23
  End time:               10-Sep-2007 08:22:51
  Elapsed time:           1 min 28 secs
  Priority:               10
  FD Files Written:       15,236
  SD Files Written:       15,236
  FD Bytes Written:       636,426,379 (636.4 MB)
  SD Bytes Written:       638,628,613 (638.6 MB)
  Rate:                   7232.1 KB/s
  Software Compression:   None
  VSS:                    no
  Encryption:             no
  Volume name(s):         TestVolume001|TestVolume002
  Volume Session Id:      2
  Volume Session Time:    1189405277
  Last Volume Bytes:      222,311,478 (222.3 MB)
  Non-fatal FD errors:    0
  SD Errors:              0
  FD termination status:  OK
  SD termination status:  OK
  Termination:            Backup OK

10-Sep 08:22 rufus-dir: Begin pruning Jobs.
10-Sep 08:22 rufus-dir: No Jobs found to prune.
10-Sep 08:22 rufus-dir: Begin pruning Files.
10-Sep 08:22 rufus-dir: No Files found to prune.
10-Sep 08:22 rufus-dir: End auto prune.

10-Sep 08:23 rufus-sd: Job write elapsed time = 00:01:33, Transfer rate = 
4.786 M bytes/second
10-Sep 08:23 rufus-dir: Bacula rufus-dir 2.2.2 (06Aug07): 10-Sep-2007 08:23:02
  Build OS:               i686-pc-linux-gnu redhat 
  JobId:                  3
  Job:                    Multi-client.2007-09-10_08.21.23
  Backup Level:           Full
  Client:                 "rufus-fd" 2.2.2 (06Aug07) i686-pc-linux-gnu,redhat,
  FileSet:                "Full Set" 2007-09-10 08:21:23
  Pool:                   "Default" (From Job resource)
  Storage:                "File1" (From command line)
  Scheduled time:         10-Sep-2007 08:21:22
  Start time:             10-Sep-2007 08:21:27
  End time:               10-Sep-2007 08:23:02
  Elapsed time:           1 min 35 secs
  Priority:               10
  FD Files Written:       6,696
  SD Files Written:       6,696
  FD Bytes Written:       444,252,164 (444.2 MB)
  SD Bytes Written:       445,181,110 (445.1 MB)
  Rate:                   4676.3 KB/s
  Software Compression:   None
  VSS:                    no
  Encryption:             no
  Volume name(s):         TestVolume002
  Volume Session Id:      3
  Volume Session Time:    1189405277
  Last Volume Bytes:      313,858,029 (313.8 MB)
  Non-fatal FD errors:    0
  SD Errors:              0
  FD termination status:  OK
  SD termination status:  OK
  Termination:            Backup OK

10-Sep 08:23 rufus-dir: Begin pruning Jobs.
10-Sep 08:23 rufus-dir: No Jobs found to prune.
10-Sep 08:23 rufus-dir: Begin pruning Files.
10-Sep 08:23 rufus-dir: No Files found to prune.
10-Sep 08:23 rufus-dir: End auto prune.

@# 
@# now do a restore
@#
@tee   tmp/log2.out
restore where=/home/kern/bacula/xregress/tmp/bacula-restores client=rufus-fd 
select storage=File1
Automatically selected FileSet: Full Set
+-------+-------+----------+-------------+---------------------+---------------+
| JobId | Level | JobFiles | JobBytes    | StartTime           | 
VolumeName    |
+-------+-------+----------+-------------+---------------------+---------------+
|     3 | F     |    6,696 | 444,252,164 | 2007-09-10 08:21:27 | 
TestVolume002 |
+-------+-------+----------+-------------+---------------------+---------------+
You have selected the following JobId: 3

Building directory tree for JobId 3 ...  
+++++++++++++++++++++++++++++++++++++++++++++
1 Job, 6,078 files inserted into the tree.

You are now entering file selection mode where you add (mark) and
remove (unmark) files to be restored. No files are initially added, unless
you used the "all" keyword on the command line.
Enter "done" to leave this mode.

cwd is: /
$ unmark *
6,696 files unmarked.
$ mark *
6,696 files marked.
$ done
Bootstrap records written 
to /home/kern/bacula/xregress/working/rufus-dir.restore.1.bsr

The job will require the following
   Volume(s)                 Storage(s)                SD Device(s)
===========================================================================
   
   TestVolume002             File1                     FileStorage1             


6,696 files selected to be restored.

Run Restore job
JobName:         RestoreFiles
Bootstrap:       /home/kern/bacula/xregress/working/rufus-dir.restore.1.bsr
Where:           /home/kern/bacula/xregress/tmp/bacula-restores
Replace:         always
FileSet:         Full Set
Backup Client:   rufus-fd
Restore Client:  rufus-fd
Storage:         File1
When:            2007-09-10 08:23:03
Catalog:         MyCatalog
Priority:        10
OK to run? (yes/mod/no): yes
Job queued. JobId=4
wait
You have messages.
messages
10-Sep 08:23 rufus-dir: Start Restore Job RestoreFiles.2007-09-10_08.23.03
10-Sep 08:23 rufus-dir: Using Device "FileStorage1"
10-Sep 08:23 rufus-sd: Ready to read from volume "TestVolume002" on 
device "FileStorage1" (/home/kern/bacula/xregress/tmp).
10-Sep 08:23 rufus-sd: Forward spacing Volume "TestVolume002" to file:block 
0:64724.
10-Sep 08:23 rufus-fd: RestoreFiles.2007-09-10_08.23.03 Warning: Cannot change 
owner and/or group 
of 
/home/kern/bacula/xregress/tmp/bacula-restores/home/kern/bacula/k/src/win32/stored/:
 
ERR=Operation not permitted
10-Sep 08:23 rufus-fd: RestoreFiles.2007-09-10_08.23.03 Error: attribs.c:442 
Unable to set file 
owner /home/kern/bacula/xregress/tmp/bacula-restores/home/kern/bacula/k/: 
ERR=Operation not permitted
[ ... snip ... ]
10-Sep 08:23 rufus-dir: RestoreFiles.2007-09-10_08.23.03 Error: Bacula 
rufus-dir 2.2.2 (06Aug07): 10-Sep-2007 08:23:23
  Build OS:               i686-pc-linux-gnu redhat 
  JobId:                  4
  Job:                    RestoreFiles.2007-09-10_08.23.03
  Restore Client:         rufus-fd
  Start time:             10-Sep-2007 08:23:05
  End time:               10-Sep-2007 08:23:23
  Files Expected:         6,696
  Files Restored:         2,525
  Bytes Restored:         188,395,392
  Rate:                   10466.4 KB/s
  FD Errors:              240
  FD termination status:  Error
  SD termination status:  OK
  Termination:            *** Restore Error ***

10-Sep 08:23 rufus-dir: Begin pruning Jobs.
10-Sep 08:23 rufus-dir: No Jobs found to prune.
10-Sep 08:23 rufus-dir: Begin pruning Files.
10-Sep 08:23 rufus-dir: No Files found to prune.
10-Sep 08:23 rufus-dir: End auto prune.

 
  !!!!! multi-client-test failed!!! !!!!! 
  !!!!! Bad Job termination status     !!!!! 
        bstat=0 rstat=1 dstat=0

================= end failure example ==================

Now what happened above is that thre were three backup jobs run. JobId 3
Multi-client.2007-09-10_08.21.23 shows a successful backup to TestVolume002.   
However, it *actually* wrote to TestVolume001 and TestVolume002.  When the
restore is done, only TestVolume002 appears in the output, and only the files
that were on TestVolume002 were restored.  This is because *all* the JobMedia
records indicate they were written on TestVolume002.

Now the output from the same test that succeeds (after the patch is applied)
looks like:

==================== begin successful backup/restore =============

10-Sep 12:37 rufus-dir: Start Backup JobId 1, Job=roxie.2007-09-10_12.37.07
10-Sep 12:37 rufus-dir: Using Device "FileStorage1"
10-Sep 12:37 rufus-sd: Wrote label to prelabeled Volume "TestVolume001" on 
device "FileStorage1" (/home/kern/bacula/regress/tmp)
10-Sep 12:37 rufus-dir: Start Backup JobId 2, Job=lmatou.2007-09-10_12.37.08
10-Sep 12:37 rufus-dir: Start Backup JobId 3, 
Job=Multi-client.2007-09-10_12.37.09
10-Sep 12:37 rufus-dir: Using Device "FileStorage1"
10-Sep 12:37 rufus-dir: Using Device "FileStorage1"
10-Sep 12:37 rufus-sd: Job write elapsed time = 00:00:25, Transfer rate = 
5.127 M bytes/second
10-Sep 12:37 rufus-dir: Bacula rufus-dir 2.3.3 (06Sep07): 10-Sep-2007 12:37:35
  Build OS:               i686-pc-linux-gnu redhat 
  JobId:                  1
  Job:                    roxie.2007-09-10_12.37.07
  Backup Level:           Full
  Client:                 "roxie-fd" 2.2.1 (30Aug07) 
i686-pc-linux-gnu,suse,10.2
  FileSet:                "roxie-FileSet" 2007-09-10 12:37:07
  Pool:                   "Default" (From Job resource)
  Storage:                "File1" (From command line)
  Scheduled time:         10-Sep-2007 12:37:06
  Start time:             10-Sep-2007 12:37:08
  End time:               10-Sep-2007 12:37:35
  Elapsed time:           27 secs
  Priority:               10
  FD Files Written:       7,400
  SD Files Written:       7,400
  FD Bytes Written:       127,110,275 (127.1 MB)
  SD Bytes Written:       128,178,908 (128.1 MB)
  Rate:                   4707.8 KB/s
  Software Compression:   None
  VSS:                    no
  Encryption:             no
  Volume name(s):         TestVolume001
  Volume Session Id:      1
  Volume Session Time:    1189420623
  Last Volume Bytes:      363,458,103 (363.4 MB)
  Non-fatal FD errors:    0
  SD Errors:              0
  FD termination status:  OK
  SD termination status:  OK
  Termination:            Backup OK

10-Sep 12:37 rufus-dir: Begin pruning Jobs.
10-Sep 12:37 rufus-dir: No Jobs found to prune.
10-Sep 12:37 rufus-dir: Begin pruning Files.
10-Sep 12:37 rufus-dir: No Files found to prune.
10-Sep 12:37 rufus-dir: End auto prune.

10-Sep 12:38 rufus-sd: User defined maximum volume capacity 900,000,000 
exceeded on device "FileStorage1" (/home/kern/bacula/regress/tmp).
10-Sep 12:38 rufus-sd: End of medium on Volume "TestVolume001" 
Bytes=899,939,867 Blocks=13,950 at 10-Sep-2007 12:38.
10-Sep 12:38 rufus-sd: Wrote label to prelabeled Volume "TestVolume002" on 
device "FileStorage1" (/home/kern/bacula/regress/tmp)
10-Sep 12:38 rufus-sd: New volume "TestVolume002" mounted on 
device "FileStorage1" (/home/kern/bacula/regress/tmp) at 10-Sep-2007 12:38.
10-Sep 12:38 rufus-sd: Job write elapsed time = 00:01:24, Transfer rate = 
7.602 M bytes/second
10-Sep 12:38 rufus-dir: Bacula rufus-dir 2.3.3 (06Sep07): 10-Sep-2007 12:38:40
  Build OS:               i686-pc-linux-gnu redhat 
  JobId:                  2
  Job:                    lmatou.2007-09-10_12.37.08
  Backup Level:           Full
  Client:                 "lmatou-fd" 2.2.0 (08Aug07) 
i686-pc-linux-gnu,redhat,
  FileSet:                "lmatou-FileSet" 2007-09-10 12:37:08
  Pool:                   "Default" (From Job resource)
  Storage:                "File1" (From command line)
  Scheduled time:         10-Sep-2007 12:37:07
  Start time:             10-Sep-2007 12:37:09
  End time:               10-Sep-2007 12:38:40
  Elapsed time:           1 min 31 secs
  Priority:               10
  FD Files Written:       15,236
  SD Files Written:       15,236
  FD Bytes Written:       636,426,379 (636.4 MB)
  SD Bytes Written:       638,628,613 (638.6 MB)
  Rate:                   6993.7 KB/s
  Software Compression:   None
  VSS:                    no
  Encryption:             no
  Volume name(s):         TestVolume001|TestVolume002
  Volume Session Id:      2
  Volume Session Time:    1189420623
  Last Volume Bytes:      274,050,095 (274.0 MB)
  Non-fatal FD errors:    0
  SD Errors:              0
  FD termination status:  OK
  SD termination status:  OK
  Termination:            Backup OK

10-Sep 12:38 rufus-dir: Begin pruning Jobs.
10-Sep 12:38 rufus-dir: No Jobs found to prune.
10-Sep 12:38 rufus-dir: Begin pruning Files.
10-Sep 12:38 rufus-dir: No Files found to prune.
10-Sep 12:38 rufus-dir: End auto prune.

10-Sep 12:38 rufus-sd: Job write elapsed time = 00:01:38, Transfer rate = 
4.542 M bytes/second
10-Sep 12:38 rufus-dir: Bacula rufus-dir 2.3.3 (06Sep07): 10-Sep-2007 12:38:48
  Build OS:               i686-pc-linux-gnu redhat 
  JobId:                  3
  Job:                    Multi-client.2007-09-10_12.37.09
  Backup Level:           Full
  Client:                 "rufus-fd" 2.3.3 (06Sep07) i686-pc-linux-gnu,redhat,
  FileSet:                "Full Set" 2007-09-10 12:37:09
  Pool:                   "Default" (From Job resource)
  Storage:                "File1" (From command line)
  Scheduled time:         10-Sep-2007 12:37:08
  Start time:             10-Sep-2007 12:37:09
  End time:               10-Sep-2007 12:38:48
  Elapsed time:           1 min 39 secs
  Priority:               10
  FD Files Written:       6,696
  SD Files Written:       6,696
  FD Bytes Written:       444,252,164 (444.2 MB)
  SD Bytes Written:       445,181,110 (445.1 MB)
  Rate:                   4487.4 KB/s
  Software Compression:   None
  VSS:                    no
  Encryption:             no
  Volume name(s):         TestVolume001|TestVolume002
  Volume Session Id:      3
  Volume Session Time:    1189420623
  Last Volume Bytes:      313,858,022 (313.8 MB)
  Non-fatal FD errors:    0
  SD Errors:              0
  FD termination status:  OK
  SD termination status:  OK
  Termination:            Backup OK

10-Sep 12:38 rufus-dir: Begin pruning Jobs.
10-Sep 12:38 rufus-dir: No Jobs found to prune.
10-Sep 12:38 rufus-dir: Begin pruning Files.
10-Sep 12:38 rufus-dir: No Files found to prune.
10-Sep 12:38 rufus-dir: End auto prune.

@# 
@# now do a restore
@#
@tee   tmp/log2.out
restore where=/home/kern/bacula/regress/tmp/bacula-restores client=rufus-fd 
select storage=File1
Automatically selected FileSet: Full Set
+-------+-------+----------+-------------+---------------------+---------------+
| JobId | Level | JobFiles | JobBytes    | StartTime           | 
VolumeName    |
+-------+-------+----------+-------------+---------------------+---------------+
|     3 | F     |    6,696 | 444,252,164 | 2007-09-10 12:37:09 | 
TestVolume001 |
|     3 | F     |    6,696 | 444,252,164 | 2007-09-10 12:37:09 | 
TestVolume002 |
+-------+-------+----------+-------------+---------------------+---------------+
You have selected the following JobId: 3

Building directory tree for JobId 3 ...  
+++++++++++++++++++++++++++++++++++++++++++++
1 Job, 6,078 files inserted into the tree.

You are now entering file selection mode where you add (mark) and
remove (unmark) files to be restored. No files are initially added, unless
you used the "all" keyword on the command line.
Enter "done" to leave this mode.

cwd is: /
$ unmark *
6,696 files unmarked.
$ mark *
6,696 files marked.
$ done
Bootstrap records written 
to /home/kern/bacula/regress/working/rufus-dir.restore.1.bsr

The job will require the following
   Volume(s)                 Storage(s)                SD Device(s)
===========================================================================
   
   TestVolume001             File1                     FileStorage1             
   TestVolume002             File1                     FileStorage1             


6,696 files selected to be restored.

Run Restore job
JobName:         RestoreFiles
Bootstrap:       /home/kern/bacula/regress/working/rufus-dir.restore.1.bsr
Where:           /home/kern/bacula/regress/tmp/bacula-restores
Replace:         always
FileSet:         Full Set
Backup Client:   rufus-fd
Restore Client:  rufus-fd
Storage:         File1
When:            2007-09-10 12:38:49
Catalog:         MyCatalog
Priority:        10
OK to run? (yes/mod/no): yes
Job queued. JobId=4
wait
You have messages.
messages
10-Sep 12:38 rufus-dir: Start Restore Job RestoreFiles.2007-09-10_12.38.49
10-Sep 12:38 rufus-dir: Using Device "FileStorage1"
10-Sep 12:38 rufus-sd: Ready to read from volume "TestVolume001" on 
device "FileStorage1" (/home/kern/bacula/regress/tmp).
10-Sep 12:38 rufus-sd: Forward spacing Volume "TestVolume001" to file:block 
0:10644692.
10-Sep 12:39 rufus-sd: End of file 0 on device "FileStorage1" 
(/home/kern/bacula/regress/tmp), Volume "TestVolume001"
10-Sep 12:39 rufus-sd: End of Volume at file 0 on device "FileStorage1" 
(/home/kern/bacula/regress/tmp), Volume "TestVolume001"
10-Sep 12:39 rufus-sd: Ready to read from volume "TestVolume002" on 
device "FileStorage1" (/home/kern/bacula/regress/tmp).
10-Sep 12:39 rufus-sd: Forward spacing Volume "TestVolume002" to file:block 
0:212.
10-Sep 12:39 rufus-sd: End of file 0 on device "FileStorage1" 
(/home/kern/bacula/regress/tmp), Volume "TestVolume002"
10-Sep 12:39 rufus-sd: End of Volume at file 0 on device "FileStorage1" 
(/home/kern/bacula/regress/tmp), Volume "TestVolume002"
10-Sep 12:39 rufus-sd: End of all volumes.
10-Sep 12:39 rufus-dir: Bacula rufus-dir 2.3.3 (06Sep07): 10-Sep-2007 12:39:52
  Build OS:               i686-pc-linux-gnu redhat 
  JobId:                  4
  Job:                    RestoreFiles.2007-09-10_12.38.49
  Restore Client:         rufus-fd
  Start time:             10-Sep-2007 12:38:51
  End time:               10-Sep-2007 12:39:52
  Files Expected:         6,696
  Files Restored:         6,696
  Bytes Restored:         444,252,164
  Rate:                   7282.8 KB/s
  FD Errors:              0
  FD termination status:  OK
  SD termination status:  OK
  Termination:            Restore OK

10-Sep 12:39 rufus-dir: Begin pruning Jobs.
10-Sep 12:39 rufus-dir: No Jobs found to prune.
10-Sep 12:39 rufus-dir: Begin pruning Files.
10-Sep 12:39 rufus-dir: No Files found to prune.
10-Sep 12:39 rufus-dir: End auto prune.
==================== end successful backup/restore ==========

Now if we dump the JobMedia records for the Job that failed to restore
properly, we get:

Enter SQL query: select * from JobMedia where JobId=3;
+------------+-------+---------+------------+-----------+-----------+---------+------------+-------------+----------+------+--------+
| JobMediaId | JobId | MediaId | FirstIndex | LastIndex | StartFile | 
EndFile | StartBlock | EndBlock    | VolIndex | Copy | Stripe |
+------------+-------+---------+------------+-----------+-----------+---------+------------+-------------+----------+------+--------+
|          2 |     3 |       2 |          1 |     4,324 |         0 |       
0 | 57,867,465 | 899,939,859 |        1 |    0 |      0 |
|          5 |     3 |       2 |      4,324 |     6,696 |         0 |       
0 |        212 | 313,858,028 |        2 |    0 |      0 |
+------------+-------+---------+------------+-----------+-----------+---------+------------+-------------+----------+------+--------+

The key is to note that the MediaId is 2 for both of the records. That is 
perfectly
legal, but in this case, the first record was actually written on MediaId=1. 
It was just incorrectly recorded.

Below is an example from the same test that succeeded.  Note, the first record
has MediaId=1 and the second has MediaId=2. Some of the details of the start 
and
end block numbers are different, but that just shows that the timing was 
differnent.


Enter SQL query: select * from JobMedia where JobId=3;
+------------+-------+---------+------------+-----------+-----------+---------+------------+-------------+----------+------+--------+
| JobMediaId | JobId | MediaId | FirstIndex | LastIndex | StartFile | 
EndFile | StartBlock | EndBlock    | VolIndex | Copy | Stripe |
+------------+-------+---------+------------+-----------+-----------+---------+------------+-------------+----------+------+--------+
|          3 |     3 |       1 |          1 |     4,198 |         0 |       
0 | 46,771,401 | 899,230,227 |        1 |    0 |      0 |
|          5 |     3 |       2 |      4,198 |     6,696 |         0 |       
0 |     64,724 | 313,858,028 |        2 |    0 |      0 |
+------------+-------+---------+------------+-----------+-----------+---------+------------+-------------+----------+------+--------+

One aspect of this is to notice that if you sort the JobMedia records by 
JobMediaId
as they are in the listing, for any given MediaId, the records *must* have 
increasing
StartBlock and EndBlock and they must not overlap.  In the case of the 
successful 
Job, the MediaIds are different, so we cannot compare Start and End Block 
numbers, but
in the case of the failed Job, we see that there are two JobMedia records
for the same MediaId (perfectly OK), *but* the first one has
a larger Start block than the second record, which is not possible.            

Now obviously, the simplest thing to do is to detect this condition and
correct the MediaId.  However, this is not always so easy, because the Job
output does not *clearly* show what Volumes were written for what job, though
it can be deduced from looking at the full output, knowing when jobs start and
stop and looking at the times when Bacula changes Volumes.  In addition, there
may be a very large number of JobMedia records if it is a big job.

Now for a look at the bsr files:

======= Unsuccessful bsr file ==================
# 10-Sep-2007 12:47:51 - Multi-client.2007-09-10_12.46.29 - Full
Volume="TestVolume002"
MediaType="File1"
VolSessionId=3
VolSessionTime=1189421182
VolFile=0-0
VolBlock=57867465-899939866
FileIndex=1-4337
Volume="TestVolume002"
MediaType="File1"
VolSessionId=3
VolSessionTime=1189421182
VolFile=0-0
VolBlock=64724-313858021
FileIndex=4337-6696
================================================

======== Successful bsr file ================
# 10-Sep-2007 12:38:48 - Multi-client.2007-09-10_12.37.09 - Full
Volume="TestVolume001"
MediaType="File1"
VolSessionId=3
VolSessionTime=1189420623
VolFile=0-0
VolBlock=10644692-899939866
FileIndex=1-4344
Volume="TestVolume002"
MediaType="File1"
VolSessionId=3
VolSessionTime=1189420623
VolFile=0-0
VolBlock=212-313858021
FileIndex=4344-6696
============================================
                  
What is key in looking at these two files is that first and incorrect restore
lists TestVolume002 twice, whereas the second and successful restore lists
TestVolume001 and TestVolume002.  So, if you understand the problem,
you can also patch things up for the restore by manually editing the
.bsr file when you get to the yes/mod/no prompt before you answer yes.
For example, in the unsucceful restore bsr file shown above, if I modify
the second line of the file to read:

Volume="TestVolume001"

then run the restore, it will work.  Now obviously this is much more
complicated if there are a lot of volumes involved ...

===

Another way of correcting the problem, may be to run bscan.  However, there
are a number of aspects of bscanning that I don't like:

1. If you use the old bscan, it may fail the same way the original backup
failed to write correct JobMedia records (I haven't tested it).  

2. Bscanning is slow.

3. Bscan does not restore *all* the info in the catalog, so you will
be left with a catalog that is so-so.

4. Bscanning with a 2.2.3 version may work, but it is as of this
moment it is untested.

If you want to go the bscan route, I recommend saving a copy of your
database (just in case), but then instead of purging your database and
recreating it with bscan, I would simply create a different database and
if it works, use that one for restores.

Finally, if bscan does work, it may be possible to purge just the job that
fails to restore correctly, then bscanning only that job (probably using a
.bsr file).  This minimize the loss of information stored in your database.
This has not been tried yet.

===

Before investigating bscan in more detail, I will release 2.2.3 and work
on getting a patch for 2.0.x systems.

This document can be found at www.bacula.org/downloads/bug-395.txt and will be 
updated as the situation progresses.

Regards,

Kern

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

[Bacula-users] Technical aspects of the restore bug

Reply via email to