Hi,
this year I switched our backup server to Ubuntu 12.04.
For the first month everything was ok until the last Ubuntu update.
The backup structure is as follows:
On the storage server, a huge disk storage is attached. Here only the file
daemon is running.
On the backup server the director and the storage daemons are running.
Once a week the complete files from the storage server are written onto tape.
Two jobs (two tape drives) with a delay, but they might run simultaneously for
some time.
OS is in both cases Ubuntu 12.04 64 bit.
Kernel: 3.2.0-27
Bacula taken from the Ubuntu packages: 5.2.5-0ubuntu6.1
As the amount of data is rather big, the jobs might request a second tape from
their pool. We don't use an auto-changer, so I need to replace the tape on
Monday morning. Then the job continues.
That way it worked for about 3 months until the latest update, which contained
an update of the bacula packages.
The update was done on both machines.
Now the following error occurs:
25-Aug 09:08 ttl010-dir JobId 31: Start Backup JobId 31,
Job=backup4.2012-08-25_09.08.00_15
25-Aug 09:08 ttl010-dir JobId 31: Using Device "Drive-1"
25-Aug 09:08 ttl010-sd JobId 31: Recycled volume "PB4T1" on device "Drive-1"
(/dev/nst0), all previous data lost.
25-Aug 09:08 ttl010-dir JobId 31: Volume used once. Marking Volume "PB4T1" as
Used.
25-Aug 16:17 ttl010-sd JobId 31: End of Volume "PB4T1" at 1152:13356 on device
"Drive-1" (/dev/nst0). Write of 64512 bytes got -1.
25-Aug 16:17 ttl010-sd JobId 31: Re-read of last block succeeded.
25-Aug 16:17 ttl010-sd JobId 31: End of medium on Volume "PB4T1"
Bytes=1,152,787,894,272 Blocks=17,869,355 at 25-Aug-2012 16:17.
25-Aug 16:18 ttl010-sd JobId 31: Job backup4.2012-08-25_09.08.00_15 is waiting.
Cannot find any appendable volumes.
Please use the "label" command to create a new Volume for:
Storage: "Drive-1" (/dev/nst0)
Pool: Pool-backup4
Media type: LTO-4
25-Aug 16:33 ttl011-fd JobId 31: Error: bsock.c:389 Write error sending 65536
bytes to Storage daemon:160.220.129.201:9103: ERR=Connection timed out
25-Aug 16:33 ttl011-fd JobId 31: Fatal error: backup.c:1190 Network send error
to SD. ERR=Connection timed out
25-Aug 16:33 ttl010-sd JobId 31: Error: bsock.c:389 Write error sending -6
bytes to client:160.220.129.203:36643: ERR=Connection reset by peer
25-Aug 16:33 ttl010-dir JobId 31: Error: Bacula ttl010-dir 5.2.5 (26Jan12):
Build OS: x86_64-pc-linux-gnu ubuntu 12.04
JobId: 31
Job: backup4.2012-08-25_09.08.00_15
Backup Level: Full
Client: "ttl011-fd" 5.2.5 (26Jan12)
x86_64-pc-linux-gnu,ubuntu,12.04
FileSet: "backup" 2012-05-14 19:58:03
Pool: "Pool-backup4" (From Job resource)
Catalog: "MyCatalog" (From Client resource)
Storage: "Tape1" (From Job resource)
Scheduled time: 25-Aug-2012 09:08:00
Start time: 25-Aug-2012 09:08:03
End time: 25-Aug-2012 16:33:27
Elapsed time: 7 hours 25 mins 24 secs
Priority: 10
FD Files Written: 252,220
SD Files Written: 0
FD Bytes Written: 1,151,889,284,025 (1.151 TB)
SD Bytes Written: 0 (0 B)
Rate: 43103.2 KB/s
Software Compression: None
VSS: no
Encryption: no
Accurate: no
Volume name(s): PB4T1
Volume Session Id: 2
Volume Session Time: 1345456088
Last Volume Bytes: 0 (0 B)
Non-fatal FD errors: 1
SD Errors: 0
FD termination status: Error
SD termination status: Wait for new Volume
Termination: *** Backup Error ***
After this error occurred the first time I modified my file and storage daemons
to use a heartbeat. But the problem is still present.
Dmesg gives the following:
[1701840.528104] INFO: task bacula-sd:25961 blocked for more than 120 seconds.
[1701840.528113] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[1701840.528120] bacula-sd D ffffffff81806080 0 25961 1
0x00000000
[1701840.528130] ffff880118dc7bf8 0000000000000082 ffff880118dc7ba8
ffff880118dc7ba8
[1701840.528141] ffff880118dc7fd8 ffff880118dc7fd8 ffff880118dc7fd8
0000000000013780
[1701840.528151] ffffffff81c0d020 ffff880115f22de0 ffff880118dc7bd8
7fffffffffffffff
[1701840.528160] Call Trace:
[1701840.528178] [<ffffffff81657d8f>] schedule+0x3f/0x60
[1701840.528188] [<ffffffff816583d5>] schedule_timeout+0x2a5/0x320
[1701840.528199] [<ffffffff8130d117>] ? kobject_put+0x27/0x60
[1701840.528208] [<ffffffff81659f85>] ? _raw_spin_lock_irq+0x15/0x20
[1701840.528217] [<ffffffff81657bcf>] wait_for_common+0xdf/0x180
[1701840.528227] [<ffffffff8105fae0>] ? try_to_wake_up+0x200/0x200
[1701840.528235] [<ffffffff81657d4d>] wait_for_completion+0x1d/0x20
[1701840.528256] [<ffffffffa0186b1a>] st_do_scsi.constprop.17+0x12a/0x280 [st]
[1701840.528267] [<ffffffffa018c033>] do_load_unload.part.13+0x98/0x12f [st]
[1701840.528278] [<ffffffffa018b04a>] st_ioctl+0xa6a/0xb80 [st]
[1701840.528288] [<ffffffff8118a16a>] do_vfs_ioctl+0x8a/0x340
[1701840.528296] [<ffffffff8117823d>] ? vfs_read+0x10d/0x180
[1701840.528303] [<ffffffff8118a4b1>] sys_ioctl+0x91/0xa0
[1701840.528311] [<ffffffff81662282>] system_call_fastpath+0x16/0x1b
Any suggestions what is going wrong?
Thank you in advance.
Regards,
Markus
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users