Hi, 09.11.2007 10:19,, [EMAIL PROTECTED] wrote:: > Good morning, > we have bacula 1.38 running on some Debian/Linux 4.0 servers.
Upgrading to 2.2.5 might be reasonable. > We use > sqlite3 as bacula catalog. Director (dir) and Storage Daemon (sd) are on > the same server. > Until recently, everything was running perfectly. Suddendly one of the > backup fails with messages like this: > > 05-Nov 01:15 dir: Start Backup JobId 3275, Job=nvpop01.2007-11-05_01.03.11 > 05-Nov 01:15 sd: Spooling data ... > 05-Nov 01:20 sd: User specified spool size reached. > 05-Nov 01:20 sd: Writing spooled data to Volume. Despooling 2,000,050,353 > bytes ... > 05-Nov 01:21 sd: Spooling data again ... > 05-Nov 01:25 sd: User specified spool size reached. > 05-Nov 01:25 sd: Writing spooled data to Volume. Despooling 2,000,050,362 > bytes ... > 05-Nov 01:26 sd: Spooling data again ... > 05-Nov 01:30 sd: User specified spool size reached. > 05-Nov 01:30 sd: Writing spooled data to Volume. Despooling 2,000,050,334 > bytes ... > 05-Nov 01:31 sd: Spooling data again ... > 05-Nov 01:33 sd: Committing spooled data to Volume "UTw0001". Despooling > 1,408,516,846 bytes ... > 05-Nov 01:34 sd: Sending spooled attrs to the Director. Despooling 17,901,905 > bytes ... > 05-Nov 03:15 dir: nvpop01.2007-11-05_01.03.11 Fatal error: Network error with > FD during Backup: ERR=Connection reset by > peer Probably the network devices time out the seemingly idle connection between DIR and FD. The heartbeat settings might help, ensuring your network equipment - routers and firewalls especially, but also the IP stacks - use the TCP keepalive option correctly might be better but is sometimes impossible. > 05-Nov 03:15 dir: nvpop01.2007-11-05_01.03.11 Fatal error: No Job status > returned from FD. > 05-Nov 03:15 dir: nvpop01.2007-11-05_01.03.11 Error: Bacula 1.38.11 (28Jun06): > 05-Nov-2007 03:15:44 > JobId: 3275 > Job: nvpop01.2007-11-05_01.03.11 > Backup Level: Full > Client: "nvpop01-fd" i486-pc-linux-gnu,debian,4.0 > FileSet: "nvpop01FS" 2007-06-01 16:54:46 > Pool: "UTweek" > Storage: "sd" > Scheduled time: 05-Nov-2007 01:03:10 > Start time: 05-Nov-2007 01:15:44 > End time: 05-Nov-2007 03:15:44 > Elapsed time: 2 hours > Priority: 10 > FD Files Written: 0 > SD Files Written: 49,024 > FD Bytes Written: 0 (0 B) > SD Bytes Written: 7,399,999,725 (7.399 GB) > Rate: 0.0 KB/s > Software Compression: None > Volume name(s): UTw0001 > Volume Session Id: 6 > Volume Session Time: 1194198162 > Last Volume Bytes: 190,824,602,912 (190.8 GB) > Non-fatal FD errors: 0 > SD Errors: 0 > FD termination status: Error > SD termination status: OK > Termination: *** Backup Error *** > > The first thing that I noticed is that despooling attributes takes ages > (more than data backup). Use a better performing catalog database - SQLite is really not the best choice for larger databases. > In order to understand what's going on, I > created a fake directory tree with 50K empty directory. With this setup > I have little data to store but about 8MB of attributes to save (which > is about half of the real backup that's troubling us). > > I can reproduce both the long attribute despooling time and the error. I > tried to add Heartbeat interval but the Director and the Storage daemon > confg file don't seem to like this option (I have a Bacula 2.0 manual, > which states that I can put that option almost everywhere). The File > Daemon instead liked it, but it didn't make any difference. The backup > still fails. I'm not sure when heartbeat was introduced, but I was fairly sure 1.38 supported it for DIR, SD and FD. I might be wrong, of course... Can you show us where SD and DIR don't like the directive? > I see that the list of file is being sent to the catalog (if I list them > with list files jobid=nnnn), but according to the mail report the backup > failed. > > All other backup are running fine, but none of them has the same amount > of attribute data. The same backup job runs fine if I set level to > incremental. The amount of incremental attributes is 1.3MB, and it takes > 9 minutes to despool them. So I know that after 9 minutes the FD is > still there. I have set the heartbeat interval to 60 seconds, but as I > said, to no avail. > > I think that the problem might be that despooling attributes takes too > long and the FD closes connection before the director comes back to ask > for job status, but I don't know how to keep the FD waiting. It's not the FD itself, but the connection is probably closed by some equipment in between... the default timeout value for TCP is in fact two hours... but there are means to handle this, namely generating application-level traffic (heartbeat in Bacula) or using TCP keepalive options (not working with all OSes and equipment, as far as I know). > Did anybody experience this problem? How did he/she fixed it? Heartbeat interval or network equipment reconfiguration. Arno > Thank you very much > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > Bacula-users mailing list > Bacula-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/bacula-users > -- Arno Lehmann IT-Service Lehmann www.its-lehmann.de ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users