Hi Kern,

First of all thank you for your efforts on confirming and fixing this nasty 
bug! Now at least I know it was not something I did wrong in my configuration 
(I have struggled with multiple simultaneous jobs spanning several volumes 
for quite some time, as you can see from my earlier posts).

Since this is obviously a major bug I hope you won't mind providing a few 
clarifications, as soon as your time permits. Please, see my questions below. 
Of course, releasing the fix as quick as possible is of most importance, so I 
will be patient.

On Sunday 09 September 2007 05:46:24 pm Kern Sibbald wrote:
> Hello,
>
> I regret to have to announce that there is a rather serious bug in Bacula.
>
> Bacula bug #935 reports that during a restore, a large number of files are
> missing and thus not restored.  This is really quite surprising because we
> have a fairly extensive regression test suite that explicitly tests for
> this kind of problem many times.
>
> Despite our testing, there is indeed a bug in Bacula that has the following
> characteristics:
>
> 1. It happens only when multiple simultaneous Jobs are run (regardless of
> whether or not data spooling is enabled).

Does this mean that the bug only shows up if you actually *run* simulatneous 
jobs or even if you just have them enabled in config? I mean, after 
struggling with solution for this problem myself, I finally worked around it 
by fine-tuning job priorities in such a way that only small jobs (guaranteed 
to fit a single volume) are allowed to run concurrently, while large jobs 
each have a priority set to a different higher level so only one of them 
could run at a time. I suppose I should be safe with that configuration? I 
hate if I'll need to drop all of my backups, I have just finished backing up 
about 8TB of data...

> 2. It has only been observed on disk based backup, but not on tape.

I am using tapes.

> 3. Under the right circumstances (timing), it could and probably does
> happen on tape backups.

What about large multi-volume backups run exclusively? Any chances they can be 
affected too? I can test-restore some of my latest larger ones but this is a 
lot of work... Last time I did a large restore (3 volumes, 800,000+ files, 
about 1TB of data) it was 2 months ago, I was using 1.38.11 on the server at 
that time and backup was created with 1.36.2. No problems restoring.

> 4. It seems to be timing dependent, and requires multiple clients to
> reproduce.
>
> 5. Analysis indicates that it happens most often when the clients are slow
> (e.g. doing Incremental backups).
>
> 6. It has been verified to exist in versions 2.0.x and 2.2.x.
>
> 7. It should also be in version 1.38, but could not be reproduced in
> testing, perhaps due to timing considerations or the fact that the test FD
> daemons were version 2.2.2.
>
> 8. The data is correctly stored on the Volume, but incorrect index
> (JobMedia) records are stored in the database.  (the JobMedia record
> generated during the Volume change contains the index of the new Volume
> rather than the previous Volume).
>
> 9. You can prevent the problem from occurring by either turning off
> multiple simultaneous Jobs or by ensuring that while running multiple
> simultaneous Jobs that those Jobs do not span Volumes.  E.g. you could
> manually mark Volumes as full when they are sufficiently large.
>
> 10. If you are not running multiple simultaneous Jobs, you will not be
> affected by this bug.
>
> 11. If you are running multiple simultaneous Jobs to tapes, I believe there
> is a reasonable probability that this problem could show up when Jobs are
> split across tapes.
>
> 12. If you are running multiple simultaneous Jobs to disks, I believe there
> is a high probability that this problem will show up when Jobs are split
> across disks Volumes.
>
> I have uploaded patches to bug #935 (bugs.bacula.org) that will correct
> version 2.2.0, 2.2.1, and 2.2.2.  The patch has been tested only on version
> 2.2.2 and passes all regression tests as well as the specific test that
> reproduced the problem.
>
> After a little more testing, I plan to release version 2.2.3 probably on
> Monday the 10th or Tuesday.
>
> At this time, I do not have a patch for 2.0.x versions, and unless there is
> some really compelling reason to create one, I would prefer not -- it would
> not be a huge effort to back port the patch, but it would require rather
> extensive testing.  Though it is hard to make a specific recommendation, I
> believe that it probably will be the wisest and simplest to either patch
> version 2.2.x if that is what you are currently running, or upgrade to
> version 2.2.3 when it is released.
>
> It *could* be possible to manually correct the bad JobMedia records in the
> catalog, but it is not something that I would personally recommend.  If you
> *really* need data off an old tape, I recommend first trying a restore.
> Sometime tomorrow, I will provide more detailed instructions on several
> ways how to correct the problem if necessary -- all of them are somewhat
> painful.

--Ivan

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to