Sorry if this is not the right place to put this; I tried to submit
this to mantis, but it did not respond with a confirmation email to
setup my account.

We recently upgraded to the latest community version of bacula but
have seen some of our volumes landing in an error state on a daily
basis.

The following errors would occur;

2014-01-03 10Director JobId 1075931: Using Device "DD7" to write.
2014-01-03 10StorageDaemon JobId 1075945: 3305 Autochanger "load slot
2694, drive 0", status is OK.
2014-01-03 10Director JobId 1075947: Sending Accurate information.
2014-01-03 10StorageDaemon JobId 1075945: Warning: vol_mgr.c:464 Need
volume from other drive, but swap not possible. Status: read=0
num_writers=0 num_reserve=1 swap=0 vol=FilersD-02694 from dev="DD7"
(/srv/bacula/backup/0/staging-disk-diff/drive7) to "DD0"
(/srv/bacula/backup/0/staging-disk-diff/drive0)
2014-01-03 10StorageDaemon JobId 1075945: Warning: Volume
"FilersD-02694" not on device "DD0"
(/srv/bacula/backup/0/staging-disk-diff/drive0).
2014-01-03 10StorageDaemon JobId 1075945: Marking Volume
"FilersD-02694" in Error in Catalog.

Further investigation demonstrated that the volume is definitely meant
to be used for jobid 1075945.

This issue occurs in the following conditions;

  * An autochanger is in use.
  * There are many requests from the storage daemon for the director
to send the best 20 volumes to try.
  * The volume lock is contended.

When a changer is in use, a device is selected from the changer and a
volume is selected.
The device and volume are inserted safely into the volume list.
It is then the duty of the autochanger to then load the volume from
the appropriate slot to the device. If the device already contains a
volume, this is unloaded and we try to free the volume we are
attempting to unload!

However, this really just removes the acquired volume from the volume
list that we are about to load in. Since the volume list is contended
another thread will iterate to find a suitable volume. Sometimes it
will select the volume that was released by the job that should really
keep using it and (possibly) load it into its device that it has
acquired.

The originating thread expects to have its original volume available
to it still. When it discovers later in the code that its volume is no
longer sitting on its device, a swap is attempted -- but since another
job now has acquired the volume it will fail to perform the swap, and
mark the volume in error.

The problem here is we assume that what slot the autochanger has
loaded matches the slot for the requesting volume but since we have
not yet loaded the volume into the autochanger this is a dangerous
assumption.

This situation is very often the case when bacula-sd is stopped and
started. The autochanger script may maintain an independent state of
which slots are loaded into which drives, which bacula-sd no longer
has any state for. Thus on startup many of the devices in the script
autochangers' state bacula-sd will have no historical knowledge of how
they got there.

Note that performing an incorrect swap is only one outcome of this
problem (and the most obvious problem), it will depend on what is
racing, who wins the race and what the race-winner intends to do.

I have provided a small patch, which changes the autochanger
behaviour. It will only free a volume on unload when the volume being
unloaded in the autochanger matches the volume bacula expects to be in
the autochanger.

The patch also covers the zeroeth slot since the code already makes a
check for that further up.

diff -ur bacula-5.2.13/src/stored/autochanger.c
bacula-5.2.13-new/src/stored/autochanger.c
--- bacula-5.2.13/src/stored/autochanger.c 2013-02-19 19:21:35.000000000 +0000
+++ bacula-5.2.13-new/src/stored/autochanger.c 2014-01-03
13:36:04.380454536 +0000
@@ -397,8 +397,11 @@
    }
    unlock_changer(dcr);

-   if (loaded > 0) {           /* free_volume outside from changer lock */
-      free_volume(dev);        /* Free any volume associated with this drive */
+   if (dev->get_slot() == loaded) {
+      /* free_volume outside from changer lock */
+      /* avoid freeing volume when the autochanger slot differs */
+      /* from the running vol list. */
+      free_volume(dev);
    }

    if (ok) {

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
_______________________________________________
Bacula-devel mailing list
Bacula-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Reply via email to