Hi.
I'm using bacula to backup huge stuff, about 100TB. Usually it takes
about 15-16 days.
I've faced with a problem. As I understood, in bacula there is mechanism
which cares about jobs(watchdog timer). And with this mechanism I have
trouble. My job terminates after 6 days with error message:
2013-12-29 16:42:56baculasrv-dir JobId 8013: Error: Watchdog sending
kill after 518427 secs to thread stalled reading File daemon.
2013-12-29 16:42:56baculasrv-dir JobId 8013: Fatal error: Network error
with FD during Backup: ERR=Interrupted system call
2013-12-29 16:42:57baculasrv-sd JobId 8013: Elapsed time=143:47:09,
Transfer rate=58.09 M Bytes/second
2013-12-29 16:42:57baculasrv-dir JobId 8013: Error: Director's comm line
to SD dropped.
2013-12-29 16:42:57baculasrv-dir JobId 8013: Fatal error: No Job status
returned from FD.
2013-12-29 16:42:57baculasrv-dir JobId 8013: Error: Bacula baculasrv-dir
5.2.13 (19Jan13):
But my job is still active. Where is the problem? FD isn't sending
"keep-alive" packets or 6 days is hardcoded interval of maximum running
time?
In sources I see this(src/lib/bnet.c):
/*
* ****FIXME**** reduce this to a few hours once
* heartbeats are implemented
*/
bsock->timeout = 60 * 60 * 6 * 24; /* 6 days timeout */
Is it mean that heartbeat isn't implemented yet?
Now I'm changing that interval to 30 days.
Is there any more beautiful way?
------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT
organizations don't have a clear picture of how application performance
affects their revenue. With AppDynamics, you get 100% visibility into your
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel