Hello Wolfgang,

As I have mentioned before, I believe that this is an OS bug.  I have been 
able to reproduce it here, and I believe it happens when you send too much 
data too fast to the OS.  Please see below for more comments:

On Friday 18 August 2006 18:22, Wolfgang Denk wrote:
> Dear Kern,
> 
> in message <[EMAIL PROTECTED]> you wrote:
> > 
> > Thanks for the feedback. I'm happy to see that your problem is resolved. 
> 
> I start seeing this problem now, too, especially when running several
> jobs (with data spooling) in parallel, and when the tape gets full.
> 
> > I guess that the next time I respond, I should not only mention the fact 
that 
> > Bacula tends to expose a lot of driver bugs AND hardware problems, both of 
> > which are difficult and time consuming to debug ...
> 
> I don't think it's either of these in my case. When I run  into  this
> situation,  I  can  reliably  reproduce the problem (but I don't know
> which exact conditions trigger it in the first place). Here  is  what
> happens:
> 
> 17-Aug 00:40 nyx-sd: Spooling data ...
> 17-Aug 00:40 nyx-sd: Committing spooled data to Volume "K-V-DAT-3". 
Despooling 1,985,286 bytes ...
> 17-Aug 00:40 nyx-sd: Source-Other.2006-08-17_00.10.04 Error: block.c:538 
Write error at 22:2740 on device "DDS-3" (/dev/nst0). ERR=Device or resource 
busy.
> 17-Aug 00:40 nyx-sd: Re-read of last block succeeded.
> 17-Aug 00:40 nyx-sd: End of medium on Volume "K-V-DAT-3" 
Bytes=15,118,094,426 Blocks=234,370 at 17-Aug-2006 00:40.
> 17-Aug 00:41 nyx-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 17-Aug 00:41 nyx-sd: 3302 Autochanger "loaded drive 0", result is Slot 3.
> 17-Aug 00:41 nyx-sd: 3307 Issuing autochanger "unload slot 3, drive 0" 
command.
> 17-Aug 00:42 nyx-sd: 3304 Issuing autochanger "load slot 4, drive 0" 
command.
> 17-Aug 00:43 nyx-sd: 3305 Autochanger "load slot 4, drive 0", status is OK.
> 17-Aug 00:43 nyx-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 17-Aug 00:43 nyx-sd: 3302 Autochanger "loaded drive 0", result is Slot 4.
> 17-Aug 00:43 nyx-sd: Wrote label to prelabeled Volume "K-V-DAT-4" on device 
"DDS-3" (/dev/nst0)
> 17-Aug 00:43 nyx-sd: New volume "K-V-DAT-4" mounted on device 
"DDS-3" (/dev/nst0) at 17-Aug-2006 00:43.
> 17-Aug 00:43 nyx-sd: Source-Other.2006-08-17_00.10.04 Error: block.c:538 
Write error at 0:1 on device "DDS-3" (/dev/nst0). ERR=Device or resource 
busy.

The above error "Device or resource" should never happen, from what I know 
about Linux.  The write() system call should be a blocking call and it should 
never return busy, thus I consider this is most likely an operating system 
bug.  The proper behavior is for the OS to simply wait until the I/O is 
complete and then return.  

At one point, I was able to reproduce this problem here, and in 1.39.x I have 
what I consider a really crude workaround.  Basically Bacula will retry the 
I/O after waiting a bit.  Using this technique, I was able to eliminate the 
problem here, but I am not really satisfied with the fix as the amount of 
time that Bacula needs to wait is indeterminant.  

As far as I can tell, this problem occurs because Bacula is feeding data to 
the tape drive too fast.  I can only guess that the OS memory is getting 
completely full and at some point rather than making Bacula wait as it 
should, the OS returns an error code.  After that point, everything for that 
particular job is more or less screwed up as you can see in the output that 
follows, because Bacula gets an error during the rewind since the OS is 
probably still busy.

> 17-Aug 00:43 nyx-sd: Source-Other.2006-08-17_00.10.04 Error: Re-read of last 
block OK, but block numbers differ. Last block=0 Current block=0.
> 17-Aug 00:43 nyx-sd: Source-Other.2006-08-17_00.10.04 Fatal error: 
spool.c:249 Fatal append error on device "DDS-3" (/dev/nst0): ERR=dev.c:678 
Rewind error on "DDS-3" (/dev/nst0). ERR=Input/output error.
> 
> ...
> 
> 18-Aug 00:40 nyx-sd: Committing spooled data to Volume "K-V-DAT-4". 
Despooling 948,750,016 bytes ...
> 18-Aug 00:40 nyx-sd: Castor-Home-1.2006-08-18_00.05.00 Error: block.c:538 
Write error at 2:6438 on device "DDS-3" (/dev/nst0). ERR=Device or resource 
busy.
> 18-Aug 00:41 nyx-sd: Re-read of last block succeeded.
> 18-Aug 00:41 nyx-sd: End of medium on Volume "K-V-DAT-4" Bytes=691,688,327 
Blocks=10,727 at 18-Aug-2006 00:41.
> 18-Aug 00:41 castor-dir: Recycled volume "K-V-DAT-1"
> 18-Aug 00:41 nyx-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 18-Aug 00:41 nyx-sd: 3302 Autochanger "loaded drive 0", result is Slot 4.
> 18-Aug 00:41 nyx-sd: 3307 Issuing autochanger "unload slot 4, drive 0" 
command.
> 18-Aug 00:41 nyx-sd: 3304 Issuing autochanger "load slot 1, drive 0" 
command.
> 18-Aug 00:42 nyx-sd: 3305 Autochanger "load slot 1, drive 0", status is OK.
> 18-Aug 00:42 nyx-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 18-Aug 00:42 nyx-sd: 3302 Autochanger "loaded drive 0", result is Slot 1.
> 18-Aug 00:42 nyx-sd: Recycled volume "K-V-DAT-1" on device 
"DDS-3" (/dev/nst0), all previous data lost.
> 18-Aug 00:42 nyx-sd: New volume "K-V-DAT-1" mounted on device 
"DDS-3" (/dev/nst0) at 18-Aug-2006 00:42.
> 18-Aug 00:42 nyx-sd: Castor-Home-1.2006-08-18_00.05.00 Error: block.c:538 
Write error at 0:1 on device "DDS-3" (/dev/nst0). ERR=Device or resource 
busy.
> 18-Aug 00:42 nyx-sd: Castor-Home-1.2006-08-18_00.05.00 Error: Re-read of 
last block OK, but block numbers differ. Last block=0 Current block=0.
> 18-Aug 00:42 nyx-sd: Castor-Home-1.2006-08-18_00.05.00 Fatal error: 
spool.c:249 Fatal append error on device "DDS-3" (/dev/nst0): ERR=dev.c:678 
Rewind error on "DDS-3" (/dev/nst0). ERR=Input/output error.
> ...
> 18-Aug 00:42 nyx-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 18-Aug 00:42 nyx-sd: 3302 Autochanger "loaded drive 0", result is Slot 1.
> 18-Aug 00:42 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: block.c:538 
Write error at 0:0 on device "DDS-3" (/dev/nst0). ERR=Device or resource 
busy.
> 18-Aug 00:43 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: Backspace 
record at EOT failed. ERR=Input/output error
> 18-Aug 00:43 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: label.c:441 
Unable to write device "DDS-3" (/dev/nst0): ERR=dev.c:1430 ioctl MTBSR error 
on "DDS-3" (/dev/nst0). ERR=Input/output error.
> 
> 18-Aug 00:43 nyx-sd: Marking Volume "K-V-DAT-1" in Error in Catalog.
> 
> 
> In  this  state,  the  newly   mounted   volumes   ("K-V-DAT-4"   and
> "K-V-DAT-1")  are  empty,  i.  e.  they  have NO data on it. Even the
> previously existing bacula labels are gone.

I suspect that this is because the OS has been so totally overloaded by data 
that it gets totally lost, which in turn confuses Bacula as it attempts to 
salvage what it can from the backup.

> 
> 
> 
> I can now reproduce this failure mode; for example, I try to  recycle
> the  last  corrupted tape (which has no data on it anyway): I unmount
> it, delete it from the catalog, label it again  (using  the  same  or
> another  volume  name),  mount  it again. The hanging backup job will
> attempt to write on it again, and corrupt it again:

At this point, everything is more or less screwed up for that particular job.  
The file descriptor that the job holds for the tape is probably no longer 
valid, and as you see below, it continues to immediately get the "Device or 
resource busy" error.  Thus is is better to cancel the job, and if that does 
not resolve the problem, to restart the Storage daemon, because things have 
gone terribly wrong.

> 
> 18-Aug 01:14 nyx-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 18-Aug 01:14 nyx-sd: 3302 Autochanger "loaded drive 0", result is Slot 1.
> 18-Aug 01:14 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: block.c:538 
Write error at 0:0 on device "DDS-3" (/dev/nst0). ERR=Device or resource 
busy.
> 18-Aug 01:14 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: Backspace 
record at EOT failed. ERR=Input/output error
> 18-Aug 01:14 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: label.c:441 
Unable to write device "DDS-3" (/dev/nst0): ERR=dev.c:1430 ioctl MTBSR error 
on "DDS-3" (/dev/nst0). ERR=Input/output error.
> 
> 18-Aug 01:14 nyx-sd: Marking Volume "K-V-DAT-1" in Error in Catalog.
> 18-Aug 01:14 nyx-sd: Invalid slot=0 defined, cannot autoload Volume.
> 18-Aug 01:14 nyx-sd: Please mount Volume "K-T-DAT-2" on Storage Device 
"DDS-3" (/dev/nst0) for Job Castor-Home-2.2006-08-18_00.10.00
> 18-Aug 01:38 nyx-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 18-Aug 01:38 nyx-sd: 3302 Autochanger "loaded drive 0", result is Slot 1.
> 18-Aug 01:38 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: block.c:538 
Write error at 0:0 on device "DDS-3" (/dev/nst0). ERR=Device or resource 
busy.
> 18-Aug 01:38 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: Backspace 
record at EOT failed. ERR=Input/output error
> 18-Aug 01:38 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: label.c:441 
Unable to write device "DDS-3" (/dev/nst0): ERR=dev.c:1430 ioctl MTBSR error 
on "DDS-3" (/dev/nst0). ERR=Input/output error.
> 
> 18-Aug 01:38 nyx-sd: Marking Volume "K-V-DAT-1" in Error in Catalog.
> 18-Aug 01:38 nyx-sd: Invalid slot=0 defined, cannot autoload Volume.
> 18-Aug 01:38 nyx-sd: Please mount Volume "K-T-DAT-2" on Storage Device 
"DDS-3" (/dev/nst0) for Job Castor-Home-2.2006-08-18_00.10.00
> 18-Aug 01:43 nyx-sd: Please mount Volume "K-T-DAT-2" on Storage Device 
"DDS-3" (/dev/nst0) for Job Castor-Home-2.2006-08-18_00.10.00
> 18-Aug 03:43 nyx-sd: Please mount Volume "K-T-DAT-2" on Storage Device 
"DDS-3" (/dev/nst0) for Job Castor-Home-2.2006-08-18_00.10.00
> 18-Aug 07:43 nyx-sd: Please mount Volume "K-T-DAT-2" on Storage Device 
"DDS-3" (/dev/nst0) for Job Castor-Home-2.2006-08-18_00.10.00
> 18-Aug 10:19 nyx-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 18-Aug 10:19 nyx-sd: 3302 Autochanger "loaded drive 0", result is Slot 1.
> 18-Aug 10:19 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: block.c:538 
Write error at 0:0 on device "DDS-3" (/dev/nst0). ERR=Device or resource 
busy.
> 18-Aug 10:19 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: Backspace 
record at EOT failed. ERR=Input/output error
> 18-Aug 10:19 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: label.c:441 
Unable to write device "DDS-3" (/dev/nst0): ERR=dev.c:1430 ioctl MTBSR error 
on "DDS-3" (/dev/nst0). ERR=Input/output error.
> 
> 18-Aug 10:19 nyx-sd: Marking Volume "K-V-DAT-1" in Error in Catalog.
> 18-Aug 10:19 nyx-sd: Invalid slot=0 defined, cannot autoload Volume.
> 18-Aug 10:19 nyx-sd: Please mount Volume "K-T-DAT-2" on Storage Device 
"DDS-3" (/dev/nst0) for Job Castor-Home-2.2006-08-18_00.10.00
> 18-Aug 10:29 nyx-sd: 3301 Issuing autochanger "loaded drive 0" command.
> 18-Aug 10:29 nyx-sd: 3302 Autochanger "loaded drive 0", result is Slot 1.
> 18-Aug 10:29 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: block.c:538 
Write error at 0:0 on device "DDS-3" (/dev/nst0). ERR=Device or resource 
busy.
> 18-Aug 10:29 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: Backspace 
record at EOT failed. ERR=Input/output error
> 18-Aug 10:29 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: label.c:441 
Unable to write device "DDS-3" (/dev/nst0): ERR=dev.c:1430 ioctl MTBSR error 
on "DDS-3" (/dev/nst0). ERR=Input/output error.
> 
> 18-Aug 10:29 nyx-sd: Marking Volume "K-V-DAT-1" in Error in Catalog.
> 18-Aug 10:29 nyx-sd: Please mount Volume "K-V-DAT-1" on Storage Device 
"DDS-3" (/dev/nst0) for Job Castor-Home-2.2006-08-18_00.10.00
> 
> 
> To me it seems as if bacula somehwat gets confused when several  jobs
> are  despooling  data  in  parallel,  and  a  medium gets full and is
> changed.
> 
> 
> All this is with Bacula 1.38.11 (28Jun06).
> 
> 
> Any ideas?

I think that it is the OS that is getting confused or overloaded.  It returns 
an error status that should never happen, and from there on Bacula is unable 
to recover.

> 
> 
> Best regards,
> 
> Wolfgang Denk
> 
> -- 
> Software Engineering:  Embedded and Realtime Systems,  Embedded Linux
> Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: [EMAIL PROTECTED]
> When the ax entered the forest, the trees said, "The handle is one of
> us!"                                               -- Turkish proverb
> 
> -------------------------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job 
easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> Bacula-users mailing list
> Bacula-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/bacula-users
> 

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to