Hi Wanderlei,

i thought that too before i found that error message in syslog. Before that, i tried:
 - setting Heartbeat Interval in Director and Storage daemon to 1000 minutes
 AND
- setting TCP keepalive in /proc/sys/net/ipv4/tcp_keepalive_time to 100000 on Director and SD
(after that i restasted both Dir and SD)


I was also sniffing all communication between Director and SD to see if some device between then (switch?) is not resetting the connection - and no, connection was correctly closed (TCP FIN) by SD.

Anyway, message in syslog is clear enought to know that problem wasn't in closed connection. There's probably some kind of a bug in SD which caused SD to not send that error to Director so it can be logged.





Citát Wanderlei Huttel <wanderleihut...@gmail.com>:

Hello Azur

It looks that problem was an error on the connection:



*11-jan 05:42 server00-dir JobId 56875: Error: Director's connection to  SD
for this Job was lost.  *

You could try enable "Heartbeat Interval in your config files:
https://github.com/wanderleihuttel/bacula-utils/blob/master/dicas/heartbeat_interval.md

Best regards

*Wanderlei Hüttel*
http://www.bacula.com.br


Em sex, 11 de jan de 2019 às 06:08, <azu...@pobox.sk> escreveu:

Hi all,

i'm having problems with virtual full backup on one of our clients.
Everything was working fine for years but, suddenly, Bacula is unable
to complete the job - i was trying to run it about 10 times during
past few days, all of them ends with the same weird error.

The error is occuring near of the end of the job, after 240 volumes
were created (we have a limit of 5 GB for volume size) and 1,2 TB of
data was copied.

Here are last lines from job report:

11-jan 05:40 server08-sd JobId 56875: Ready to read from volume
"cloud0015-16052" on File device "cloud0015-file-storage"
(/backup/cloud0015).
11-jan 05:40 server08-sd JobId 56875: Forward spacing Volume
"cloud0015-16052" to addr=226
11-jan 05:40 server00-dir JobId 56875: Volume used once. Marking
Volume "cloud0015-17313" as Used.
11-jan 05:40 server08-sd JobId 56875: End of medium on Volume
"cloud0015-17313" Bytes=5,368,688,735 Blocks=83,220 at 11-Jan-2019
05:40.
11-jan 05:40 server00-dir JobId 56875: Created new
Volume="cloud0015-17314", Pool="cloud0015-pool",
MediaType="File-cloud0015" in catalog.
11-jan 05:40 server08-sd JobId 56875: Labeled new Volume
"cloud0015-17314" on File device "cloud0015-file-storage-2"
(/backup/cloud0015).
11-jan 05:40 server08-sd JobId 56875: Wrote label to prelabeled Volume
"cloud0015-17314" on File device "cloud0015-file-storage-2"
(/backup/cloud0015)
11-jan 05:40 server00-dir JobId 56875: Volume used once. Marking
Volume "cloud0015-17314" as Used.
11-jan 05:40 server08-sd JobId 56875: New volume "cloud0015-17314"
mounted on device "cloud0015-file-storage-2" (/backup/cloud0015) at
11-Jan-2019 05:40.
11-jan 05:40 server00-dir JobId 56875: Volume used once. Marking
Volume "cloud0015-17314" as Used.
11-jan 05:40 server00-dir JobId 56875: Volume used once. Marking
Volume "cloud0015-17314" as Used.
11-jan 05:41 server00-dir JobId 56875: Volume used once. Marking
Volume "cloud0015-17314" as Used.
11-jan 05:41 server00-dir JobId 56875: Volume used once. Marking
Volume "cloud0015-17314" as Used.
11-jan 05:41 server08-sd JobId 56875: End of Volume "cloud0015-16052"
at addr=5368688648 on device "cloud0015-file-storage"
(/backup/cloud0015).
11-jan 05:41 server08-sd JobId 56875: Ready to read from volume
"cloud0015-16053" on File device "cloud0015-file-storage"
(/backup/cloud0015).
11-jan 05:41 server08-sd JobId 56875: Forward spacing Volume
"cloud0015-16053" to addr=226
11-jan 05:41 server00-dir JobId 56875: Volume used once. Marking
Volume "cloud0015-17314" as Used.
11-jan 05:41 server08-sd JobId 56875: End of medium on Volume
"cloud0015-17314" Bytes=5,368,688,678 Blocks=83,220 at 11-Jan-2019
05:41.
11-jan 05:41 server00-dir JobId 56875: Created new
Volume="cloud0015-17315", Pool="cloud0015-pool",
MediaType="File-cloud0015" in catalog.
11-jan 05:41 server08-sd JobId 56875: Labeled new Volume
"cloud0015-17315" on File device "cloud0015-file-storage-2"
(/backup/cloud0015).
11-jan 05:41 server08-sd JobId 56875: Wrote label to prelabeled Volume
"cloud0015-17315" on File device "cloud0015-file-storage-2"
(/backup/cloud0015)
11-jan 05:41 server00-dir JobId 56875: Volume used once. Marking
Volume "cloud0015-17315" as Used.
11-jan 05:41 server08-sd JobId 56875: New volume "cloud0015-17315"
mounted on device "cloud0015-file-storage-2" (/backup/cloud0015) at
11-Jan-2019 05:41.
11-jan 05:41 server00-dir JobId 56875: Volume used once. Marking
Volume "cloud0015-17315" as Used.
11-jan 05:42 server00-dir JobId 56875: Volume used once. Marking
Volume "cloud0015-17315" as Used.
11-jan 05:42 server00-dir JobId 56875: Volume used once. Marking
Volume "cloud0015-17315" as Used.
11-jan 05:42 server00-dir JobId 56875: Volume used once. Marking
Volume "cloud0015-17315" as Used.
11-jan 05:42 server00-dir JobId 56875: Error: Director's connection to
SD for this Job was lost.





This is what is logged to syslog on SD side:

Jan 11 05:42:43 server08 bacula-sd[13585]: server08-sd:
vbackup.c:130-56875 Fatal append error on device
"cloud0015-file-storage-2" (/backup/cloud0015): ERR=file_dev.c:190
Could not
open(/backup/cloud0015/cloud0015-17315,OPEN_READ_WRITE,0640): ERR=No
such file or directory








There's no reason why the volume wasn't created. Permissions are ok:

# ls -ld /backup/cloud0015/
drwxr-x--- 2 bacula tape 36864 Jan 11 05:41 /backup/cloud0015/


Free disk space is sufficient:

# df -h /backup/
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdc1       7.3T  5.1T  2.2T  70% /backup







Another strange thing is that while virtual full was running, new
incremental backup was queue for that client. After full failed,
incremetnal was run AND completed fine while using THAT ONE ERRONEOUS
VOLUME:

11-jan 05:42 server00-dir JobId 56897: Start Backup JobId 56897,
Job=cloud0015.2019-01-11_01.00.17_25
11-jan 05:42 server00-dir JobId 56897: There are no more Jobs
associated with Volume "cloud0015-17315". Marking it purged.
11-jan 05:42 server00-dir JobId 56897: All records pruned from Volume
"cloud0015-17315"; marking it "Purged"
11-jan 05:42 server00-dir JobId 56897: Recycled volume "cloud0015-17315"
11-jan 05:42 server00-dir JobId 56897: Using Device
"cloud0015-file-storage" to write.
11-jan 05:42 cloud0015-fd JobId 56897: shell command: run
ClientRunBeforeJob "/var/www/backends/webbackend/misc/mysql_backup.py"
11-jan 05:42 server00-dir JobId 56897: Sending Accurate information to the
FD.
11-jan 05:52 server08-sd JobId 56897: Recycled volume
"cloud0015-17315" on File device "cloud0015-file-storage"
(/backup/cloud0015), all previous data lost.
11-jan 05:52 server00-dir JobId 56897: Volume used once. Marking
Volume "cloud0015-17315" as Used.
11-jan 05:58 server00-dir JobId 56897: Volume used once. Marking
Volume "cloud0015-17315" as Used.
11-jan 06:03 server00-dir JobId 56897: Volume used once. Marking
Volume "cloud0015-17315" as Used.
11-jan 06:06 server00-dir JobId 56897: Volume used once. Marking
Volume "cloud0015-17315" as Used.
11-jan 06:13 server00-dir JobId 56897: Volume used once. Marking
Volume "cloud0015-17315" as Used.
11-jan 06:16 server08-sd JobId 56897: Elapsed time=00:24:42, Transfer
rate=3.260 M Bytes/second
11-jan 06:16 server00-dir JobId 56897: Volume used once. Marking
Volume "cloud0015-17315" as Used.
11-jan 06:16 server08-sd JobId 56897: Sending spooled attrs to the
Director. Despooling 10,006,797 bytes ...




This starts to happen on Bacula 7.4.4 but after few days i decided to
upgrade to 9.2.1, problem persists. Debian linux, 64bit, i'm using
original Debian packages of Bacula.

I also checked the health of HDD where backups are stored, done short
and long SMART tests, file system checks etc. No errors at all.

Any hints? Thanks

azur




_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users






_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to